4 Operator

 • 

1.8K Posts

August 15th, 2007 19:00

"The system has 10 drives -- 2 in a RAID 1 array on the first controller channel (for OS and programs), and the other 8 in a RAID 5 array on the second channel (for data storage) and with 32k stripe size".
 
You should have place the raid 1 and raid 5 drives, divided over the two channels equally, though I do not think this is the major bottleneck. Did you benchmark test the 32k stripe size vs the 64k default?
 
Do you have write-back enabled?
 
"The network has been eliminated as a bottleneck (usage is less than 1% during the backup), as has the backup server."
The 1% utilization means nothing, did you check the network throughput with a network NIc utility such as NetIQ Qcheck ? Most of the time it is the network or OS, not the array
 
Did you check your cable, if you have a managed switch ,check all ports between the server and the backup unit for errors. If none or very few, try disabling flow control at the machines involved and on the switch.

5 Posts

August 15th, 2007 20:00

pcmeiners,

Thank you for the response. Unfortunately, the way the software works, the data has to all be on one volume, so we had to create the almost 1TB array and needed 15k rpm drives. When the server was built, 146GB 15k's were the largest available, hence the excessive number of drives (would have purchased the 300GB's if they were 15k at the time).

The volume is set to Adaptive Read-Ahead (Read Policy), Write-Back (Write Policy), and Direct I/O as the cache policy. We were not able to benchmark the different stripe sizes using the production software.

I had not run QCheck, but I downloaded and installed it on the two servers, and it's reporting 800Mbps throughput and all response times at 1ms using maxed out testing values. Both servers are on the same gigabit switch, with no errors reported on any device. All cabling is CAT6 certified and individually tested with our Fluke CAT6 network tester.

I have not messed around with the flow control, but I will certainly look into it. I'm wondering if it is Windows 2003, but in reality, 5+ million files really isn't that much these days, so I would think it could handle it.

I am also running a defrag on the volume, although it looks like it may take >48 hours.

I have looked at the real-time filesystem calls using Filemon from SysInternals, and it was showing response times to be >1s in some cases, even when there were no other calls being made. But other times it was in the 1/1000th second area. The backup is processed, indexed, and run locally, with just the data stream being sent to the backup server, so the calls are coming from the local system and not the backup server. Processor and memory load are almost negligible while the backup is running (processor averaging 5%, memory only about 20MB used by the backup program, and >3GB free system memory)

If I have overlooked anything, please let me know! Thanks again!

Brian

4 Operator

 • 

1.8K Posts

August 16th, 2007 11:00

You do your homework, so I assume all your firmwares and drivers are up to date, and all drives are at the same firmware revision. Pleaure to see a poster provide good info.

I really do not think it is the raid subsystem, as the PERC 4e/Di controller is a very fast adapter. What is the disk  allocation unit size? Do your other machines have the 32k stripe size ? Any idea of what the raid card utilization (cpu ), normally and during backup?

 

again it not the cause of your major bottleneck... you can divided any array over multiple channels, has nothing to do with the creation of volumes. The point  is 8 drive are beyond SCSI satuation point, wear as if you had divided both arrays over the two channels, it would not be. Likely the server does not lose any speed, as setup, unless highly stresses, not critical…but next time consider it

 

"800Mbps",  OK. Since you have no errors, you should try disabling flow control on machines and switch, this will generally give you more network throughput, again this is not a major bottleneck. If toggling it off does nothing, disable it on all machines and leave it enabled on the switches; it is redundant to have it enabled at the NICs and at the switch. The were a few posts about slow downs with the TCP offload engine parameters enabled ( something to look into)

Registry hacks,

disable Windows file timestamps

http://www.windowsnetworking.com/nt/registry/rtips71.shtml

 

disable  8.3  name creation

http://www.jsifaq.com/SF/Tips/Tip.aspx?id=0026

 

This hack can cause installation issues with older software installs, such as Veritas 8.5, 8.6

Backup the original regies, and document before changing the above.

 

Any Windows 2000 machines involved?, if so turn off SMB signing, if security on your network allows. I turn disable it at all my clients with win2000 and win2003 mixed environments

http://support.microsoft.com/default.aspx?scid=kb;en-us;Q321169

 

You might want to try  process  monitor.

http://www.microsoft.com/technet/sysinternals/utilities/processmonitor.mspx

 

Never used it but diskmon

http://www.microsoft.com/technet/sysinternals/utilities/diskmon.mspx

Not critical, and it sound impractical at your site…at most of my clients, I am able to do a boot time defrag, every couple months with Diskeeper, It speeds up programs which remain resident during a normal defrag such as SQL , Exchange etc

 

“I'm wondering if it is Windows 2003, but in reality, 5+ million files really isn't that much these days”

 

Same opinion,  can you disable the AV scanning during backup only

I am running out of ideas, anyone else have any?



Message Edited by pcmeiners on 08-16-2007 07:59 AM

777 Posts

August 17th, 2007 19:00

 Hi guys,
 
  You might want to look at your IRQ assignments, see if you are sharing the same IRQ between the RAID and the NIC. BTW IRQ's listed in device manager that are above IRQ15 are fake IRQ's, and are used when the device is sharing a physical IRQ with another device. The service routine has to poll each device that is sharing an IRQ to see who sent the IRQ, and with high usage devices, you can get another IRQ before this polling process is completed, and the first device polled ends up stealing the requests of the other device/s, that now has to wait for the IRQ request stack to get to it. In worst cases, you even have drive requests timing out, and drives dropping offline.
  Try disabling the serial and parallel ports if you are not using them.
 
Regards,
Dell-GaryS

4 Operator

 • 

1.8K Posts

August 18th, 2007 16:00

Good point, Gary

5 Posts

October 26th, 2007 13:00

Sorry for the long delay. We have been extremely busy and have done more testing.

Most of the performance was related to the backup software. We did a comparison test vs. NTBackup and NTBackup was performing at the levels of performance we were expecting. We are working with the software vendor for optimizations.

However, this past weekend, we updated the firmware across the board on the server. We thought we were up-to-date, but turns out we were a few months behind. BTW, Dell, your driver/firmware notification system DOES NOT WORK. The only notices I've received were for some network drivers, and I signed up for everything. So please, fix it so I don't have to check the site every week.

In any case, we upgraded to BIOS A06 from A05, to BMC 1.7x from 1.5x (sorry, can't remember the exact numbers), to PERC 4e/Di 522D from 522A, and applied the drive firmware updates from Dell's updating ISO.

And now performance is absolutely terrible. Our backups are running 5X slower (only processes about 1.2 million files in 13 hours, whereas it used to be able to do the 5.5 million in that time). So, we haven't had a full backup since last week.

I'm opening a ticket with Dell, but sometimes they aren't too helpful on things like this, especially if the system has to remain up in the meantime.

I have run consistency checks on all of the RAID volumes, chkdsk on all drives, and there are no anomalies or errors. We did also apply about 10 Windows updates, but none should be related, and did not affect performance on other systems with smaller arrays (backups take the same amount of time on those systems).

Otherwise, nothing has changed on the system since before the updates, so I'm pretty sure the performance issues are related to the firmware updates.

Also, both the NIC and the PERC are running "fake" high IRQs, so I'll see if I can free up some lower IRQs and give the PERC one of those.

Thank you for all of your help and suggestions!

777 Posts

October 30th, 2007 15:00

Try updating your Perc F/W to 5A2D  Link here

It's a critical update to fix a problem that could be related to your performance issue. Read the release notes.

 

Regards,

Dell-GaryS

5 Posts

October 30th, 2007 18:00

GaryS,

Thank you. I tried this last night, the consistency check ran, and performance is the same. I just learned of the new firmware revision yesterday (it only came out on the 25th). I was hopeful but so far it has not helped.

Thank you!

777 Posts

October 30th, 2007 20:00

One other thing, check to see if the firmware updates turned on your patrol read. (patrol read is a nearly constant background consistancy check on your RAID array)
 
Dell-GaryS

4 Operator

 • 

1.8K Posts

October 30th, 2007 20:00

"Otherwise, nothing has changed on the system since before the updates, so I'm pretty sure the performance issues are related to the firmware updates."
 
You did say you updated the NIC driver at the same time, double check all the settings, Try 100 Full and the "auto detect", again try turning off flow control, at both the NIC and switch port, at this server and the backup machine. Any possibility of a temporary direct cable connect to the backup machine for testing ? Any chance of running a disk  benchmark, such as HD Tach..I don't consider it a great benchmark but  the results would highlight if you have an array issue, safe, have run it on many servers. The obvious, since the raid firmware flash have you checked the settings have not changed ?


Message Edited by pcmeiners on 10-30-2007 04:44 PM

5 Posts

October 31st, 2007 15:00

pcmeiners and GaryS,

Thank you for your feedback. I just got off the phone with my Dell support rep (VERY helpful and knowledgeable). He's been working on this for a few days now, and he just heard from the PERC developers that 522D introduced an algorithm change that affects RAID5 volumes with more than 5 disks. Apparently this algorithm (supposedly parity calculations) is slower than the older algorithm, but safer for the data. I never had any data corruption with 522A, so I have reverted back to 522A (and my Dell rep said that if we did not have problems before that we should hopefully be okay), and it is running a background initialization on the volume right now (he said it's updating and rewriting all of the parity data, but nothing is being lost). Once that completes, I'll run a consistency check and then a backup speed test. Hopefully I'll be back to the old performance levels (which wasn't stellar, but it was the software, as I mentioned in an older post).

The firmware developers have heard of these performance issues in some cases, so I'm hoping that their next release (after 5A2D, since performance was not improved with 5A2D) will have both the safety AND performance.

I'll post an update once I'm able to do some performance testing.

Thanks again!

4 Operator

 • 

1.8K Posts

October 31st, 2007 16:00

"Apparently this algorithm (supposedly parity calculations) is slower than the older algorithm,"
On a backup, parity creation is a small issue, as most i/o is reads, so there must be more to this.
 
Anyway at least you can roll back to the older firmware. Glad you have some resolution to this issue.

9 Posts

November 27th, 2007 15:00

This might be overkill, but you might try an I/O monitoring tool like this one to help isolate these issues:   http://www.vmware.com/appliances/directory/1084
 
It's on the VMware site, but doesn't just do VMware.
No Events found!

Top