Unsolved

This post is more than 5 years old

11 Posts

5514

April 12th, 2010 05:00

Determine cause of high kernel logcount?

Is there an easy way to determine a possible cause (like what specification) for a high logcount?

We replaced one of our old (2 mirrored) fileservers with a new windows storage server 2008. Initial sync with the old machine worked fine. The final incremental sync seems to have _not_ finished this morning since I'm now looking at a climbing kernel log count which was a rare event with our old machine (with 768MB kernel cache, new machine has 3GB kernel cache).

I suspect that one of the specifications is causing this but I wonder if there is an easy way to determine what is causing the backlog (was around 200 when I looked this morning, now we're at 835, have seen it at 900+ this afternoon). It is sending/receiving in excess of 30MBytes/sec at times which we consider 'normal' for the old machine but there must be something that is keeping this logcount high.

Short description of our setup: 2 hosts with around 9 specifications (all use circular mirroring, reflect protection is checked on ALL specifications, just verified that). Old host is a 4GB windows 2003 machine (32bits, replistor 6.2.1.3), new host is a 12GB windows storage server 2008 machine (64 bits OS&Replistor 6.4.0.34).

151 Posts

April 12th, 2010 06:00

Walter, open the KernelCache.rdf or even one of the OC$.rdf files in the Data Directory.  Once open, it will show you the pending operations that are to be sent to the Target.

Is the operation on the same file?

The Circular Mirroring may be having an issue and you are replicating the same data over and over.

If any thing, perform the following:

- Disable the associated Spec's on both sides.

- Perform a DeleteDataDir

- Enable the Spec's
- Re-run the Inc Sync

2 Intern

 • 

106 Posts

April 12th, 2010 07:00

There are a couple things that come to mind.  1) There was a change in the way RS queues data to be sent to the Target node which is part of 6.4.0.34.  In the past, RS would queue all the Sync copies to occur before the mirroring updates took place.  In the most recent versions there was a change in the biasing of the queued data.  This caused RS to send file updates to the Target before the entire specification was fully in-sync.

That might be part of the problem, however there was also a deadlock found in the RS code that could also be affecting te sync in your case.  Version 6.4, SP1 was released last Friday, April 9th.  The deadlock I mentioned is fixed in that versioin.

I would recommend you install 6.4, SP1 as a first step.

2) I would not recommend you use 3GB of RAM memory for the Kernel Cache.  Here is my reasoning: By allocating 3GB of memory to RepliStor you give up that memory for use elsewhere in the machine.  Once the specifications are in-sync I would expect the Kernel Cache used to drop down close to zero.  That means that the memory should be idle or very little used all the time except during a sync.  In order to size the Kernel Cache correctly you could just, as a test, enable all the specs but not sync them for a while.  Then just monitor the Kernel Cache and see how high it gets doing nothing but Mirroring data, not syncing.  If it stays down below 10-20% then you could drop the 3GB down to 1GB or less.  This would provide enough Kernel Memory for normal day to day Mirroring and release the additional memory for use elsewhere.

From what you are saying, even 3GB of memory is not enough during an initial sync.  Since 4.095GB is the absolute maximum amount of memory RS will allow, you are close to that value already and still overflowing into Kernel Logs.

My recommendation would be to drop the Kernel Cache setting back down to about 1GB of RAM and perform Incremental Syncs with Attribute Compare Only turned on in order to get the initial Sync over with.  Since you have already been in-sync with the previous version of RS that is certainly a good alternative.  Once the sync is complete, then just let Mirroring keep it in-sync.

Circular Mirroring is not "generally" a good idea but if you must use it you are doing the right thing by having Reflect Protect turned on.  I would also suggest that Copy on Close should also be turned on.  That provides better protection if two users are accessing the same file at the same time.  The updates would be held in Queue until the last person closes the file.  Any changes made by the first person making changes may be lost if the last person makes chages to the same part of the file.  That is the danger of using Circular Mirroring.  There is no way for RS to tell which persons changes should be saved and which should be lost.

One last note: Are you seeing Forwarding Activity (Ops/Sec) as well as Mirroring Activity (Ops/Sec) currently? (See the RS GUI )  If you are seeing Mirroring Activitybut no Forwarding Activity, you might be having trouble with the known deadlock.  If Forwarding stops altogether for long periods then I would recommend you upgrade to 6.4, SP1 as son as possible.

11 Posts

April 12th, 2010 07:00

Incredible reply speeds here, thanks  @all

We're seeing mirroring & forwarding activity all day so it doesn't appear to be the deadlock issue (yet) but I was planning to install the 6.4sp1 anyway.

You're probably right with the kernel cache, although the 3GB does seem to work better when starting large sync jobs (no real surprise). We chose to do so since our observations -albeit with (much) earlier versions of replistor- showed that these 'backlog' files take about 10 times longer to process than the kernel cache. We don't _really_ care that much about the 3GB (mainly because this machine has 3 times as much memory available than the old one).

In our setup we need circular mirroring (replistor is one of the few that seems to support this rather well). And I noticed with the newer gui that the 'copy on close' automatically gets selected and grayed out when I check 'reflect protection' so some engineer already decided that this was the best/safest way to do it.

11 Posts

April 12th, 2010 07:00

Already used loglook earlier and I noticed an awfull lot of operations on *.pst files but these have always been there so I'm not sure why that is an issues all of a sudden.

Performing a delete data dir is going to hurt a bit, since the 9 specifications account for 2.5TB of data and even an incremental sync/attr. only takes "some time". We usually only do this in (long) weekends -if needed at all- but since I'm planning to upgrade to the 6.4SP1 update of last week I'll try the "delete data dir" as well and see if things improve after that.

11 Posts

April 12th, 2010 08:00

Both machines now updated to 6.4.1.3. On the new WSS (x64) machine the kernel cache bar jumped around like mad (23-74-99-12-57-80-63-41-0%) after trimming it back to 1GB and performing a delete data directory with only mirroring running (no active sync jobs). By now I tripple checked to see if I made it 1MB by accident but it sure is 1024MB (KernelCache.rdf is also 1GB in size). I've never seen it jumpy like this, very weird.

I'm now starting to sync specifications but even with a small one (around 10GB) the kernel log count is already steadily increasing. Which is a-typical for us.

Going home now, will check & report later tonight to see if things improve.

11 Posts

April 12th, 2010 14:00

Ok, the new hardware/replistor works fine. Right now everything has been synchronized both ways (9 specs, 2.5TB, 3.8Million files/folders) this was unthinkable in the past.

Synchonisation went fine, the kernel cache usage bar also seems back to normal, all counters are now back to zero, we'll see how things go tomorrow when everybody has finished their coffee and start working again.

11 Posts

April 13th, 2010 00:00

The incredible 'jumpy' kernel cache issue is back. It is really odd behaviour we have never seen before (6.2.1.3). Kernel log count is already at 21 and climbing with just 173 user sessions and 586 open files (which is usual for this time of day). The old server that has a kernel cache of only 512MB instead of 1024MB doesn't seem to suffer from this behaviour.

If we get back to zero levels I'll try to capture a movie of it and upload it somewhere.

11 Posts

April 13th, 2010 03:00

Here's a screencap of the issue http://www.youtube.com/watch?v=JTTSgKxVVW4

On the left is the new WSS2008 with replistor 6.4.1.3 (both x64) with 1024MB kernel cache. On the right is Windows 2003 with replistor 6.4.1.3 (both 32 bit) with 512MB kernel cache. During capture there was only mirror activity. The 2003 server currently has about 25% less users and 50% less open files than the WSS2008 but that has always been the case.

2 Intern

 • 

106 Posts

April 13th, 2010 06:00

Walter,

I looked at the video capture and it appears that everything is working pretty well.  I did notice that the kernel cache goes up and down more than the other node.  However, I also noticed that the Mirroring and Forwarding activity is consistently 3-4 times higher than the other node as well.  This could point to a couple possibilities.  First, the node is I believe Windows 2008 and due to some new OS internal operations it may just be more "busy".  There are a number of features in Windows 2008 that simply were not there or were optional in Windows 2003.  For example the Access Based Enumeration is default in Windows 2008 where it  was very much optional in Windows 2003.  Secondly there is also probably more file indexing going on in Windows 2008.

There may also be some tuning differences that are different in a WSS that may account for some additional file operations that RepliStor needs to replicate to the Target node.  Have you turned on Single Instance Storage, for example?  This new feature may be very helpful in conserving disk space but would be potentially more I/O intensive.

I guess what I am saying is that, at this point, I don't see anything that would indicate that RepliStor is doing anything out of the ordinary.  The thing I would be greatly concerned about is if the Mirroring Activity in Ops/sec were high but the Forwarding Activity Ops/sec are very low or zero.

I would just monitor now and make sure the Mirroring and Forwarding continue to process normally.

No Events found!

Top