Source server problems after automatic failover

Question

I'm running test on RepliStor in order to determine if this application can be used in our company's server farm.

I've 4 virtual machines:

Windows 2003 + SP1: Domain controller
Windows 2003 + SP1 + Exchange + RepliStor (source server)
Windows 2003 + SP1 + Exchange + RepliStor (target server)
Windows XP + Outlook 2003 (client)

RepliStor: 6.2.0.469
RepliStorExchange: 1.0.2.

When I use the manual failover function, failover & failback works like it should, but when I test automatic failover the following happens:

I disable the NIC of the source server. After ~2 minutes the target server takes over the alias. Clients can connect to the target server now. (This works OK)

On the source server, the Exchange services however aren't stopped and the rep_srv.exe starts using 99% CPU. The solution I've used before is to manual stop the Exchange services and use the delete data directory function to get the source server working again.

This time however, when I disabled the source server NIC during a large mirroring operation (the data dir is ~330MB), the rep_srv.exe is taking up so much resources that the clean data directory command (rep_srv cmd deletedatadir) doesn't respond anymore. I also cannot connect the client to the service.

I have two questions:
- Is this the normal operation for RepliStor failover? In my opinion the semi-crashing of the sourceserver after loosing network connection instead of a clean service shutdown seems rather dirty.

- What should I do to fix my source server? Wil manually deleting the files in the data dir do the trick?

My configuration: ESX 3 server with 6 GB ram (512 mb assigned to each virtual machine) 2 x AMD Opteron Dual Core 2600mhz

Message was edited by:
cleaudevink

dramjass · Answer

The RepliStorExchange module you are using has become the old method of performing the failover with ReplIStor and Exchange.

With ReplIStor 6.2, there is a version 2 of the RepliStorExchange module. This version proves for a much better and natural method of failing over Exchange.

How long have you left the Source server before peroforming the DeleteDataDir?
Does the Target site get Blocked or are you trying to 'clean' things up immediately after?

Typically, you would not want that much data in the Kernel Cache on a failover. If you have that much, it means ~330 MB in changes have not been replayed on the Target, which is now the new production machine.

Although the test you are doing is intereting to see what is going to happen upon a failover, it is not consisitent with the way RepliStor handles the I/O on a machine. If you have 330MB in the Kernel Cache/Logs, then it means either your bandwidth is not sufficient for what you are try to accomplish. Or, you simply overloaded the box with "not so practical" I/O and did the failover.

Replication Manager

Source server problems after automatic failover

Was this post helpful?