Unsolved
This post is more than 5 years old
5 Posts
0
37018
August 15th, 2007 14:00
Large RAID 5 array slow performance
Hello,
I have a PowerEdge 2800 with dual 3.8GHz Xeons, 4GB RAM, PERC 4e/Di controller (with 256MB RAM), all with the latest firmware and drivers, running Windows Server 2003 SP1.
The system has 10 drives -- 2 in a RAID 1 array on the first controller channel (for OS and programs), and the other 8 in a RAID 5 array on the second channel (for data storage) and with 32k stripe size. All drives are 15k U320 SCSI.
Currently there are about 5.4 million files on the RAID 5 array, for a total of about 500GB. Most of it consists of small files less than 32k (hence why we chose the smaller stripe size).
We use disk-to-disk backup over the network, and receive dismal performance backing everything up. Larger files backup and copy at "reasonable" speeds, but when backing up all of the small files, we only see performance ranging from about 20-150 files per second when doing incremental backups. The backups take about 15 hours for the 500GB, with antivirus disabled (even longer with it enabled). On a similar server with a smaller array and more varied file selection, we can back up 100GB in about 1 hour (and antivirus is *enabled*).
Our backup provider says that their software has no problem handling large datasets (their largest customer is backing up 40 million files in one job), and the server load is negligible (at idle) while the backup is running.
Most of the data is located in a 23x23 directory structure (so 23 directories, each with 23 sub-directories, and the files evenly distributed), so each sub-directory contains around 10,000 files.
Does anybody have any insight as to why we would be having such dismal performance? Local file copy is pretty slow, but access times are fast, so I don't think it's a hardware limitation of the disk drives. I have looked to see if there may be some sort of limiting factor in the OS, but have found none.
The network has been eliminated as a bottleneck (usage is less than 1% during the backup), as has the backup server. All of our other backup jobs run at very fast speeds, even on slower/older hardware, so we believe it to be a problem with the array or filesystem.
Thanks in advance for your help!
Brian
I have a PowerEdge 2800 with dual 3.8GHz Xeons, 4GB RAM, PERC 4e/Di controller (with 256MB RAM), all with the latest firmware and drivers, running Windows Server 2003 SP1.
The system has 10 drives -- 2 in a RAID 1 array on the first controller channel (for OS and programs), and the other 8 in a RAID 5 array on the second channel (for data storage) and with 32k stripe size. All drives are 15k U320 SCSI.
Currently there are about 5.4 million files on the RAID 5 array, for a total of about 500GB. Most of it consists of small files less than 32k (hence why we chose the smaller stripe size).
We use disk-to-disk backup over the network, and receive dismal performance backing everything up. Larger files backup and copy at "reasonable" speeds, but when backing up all of the small files, we only see performance ranging from about 20-150 files per second when doing incremental backups. The backups take about 15 hours for the 500GB, with antivirus disabled (even longer with it enabled). On a similar server with a smaller array and more varied file selection, we can back up 100GB in about 1 hour (and antivirus is *enabled*).
Our backup provider says that their software has no problem handling large datasets (their largest customer is backing up 40 million files in one job), and the server load is negligible (at idle) while the backup is running.
Most of the data is located in a 23x23 directory structure (so 23 directories, each with 23 sub-directories, and the files evenly distributed), so each sub-directory contains around 10,000 files.
Does anybody have any insight as to why we would be having such dismal performance? Local file copy is pretty slow, but access times are fast, so I don't think it's a hardware limitation of the disk drives. I have looked to see if there may be some sort of limiting factor in the OS, but have found none.
The network has been eliminated as a bottleneck (usage is less than 1% during the backup), as has the backup server. All of our other backup jobs run at very fast speeds, even on slower/older hardware, so we believe it to be a problem with the array or filesystem.
Thanks in advance for your help!
Brian
0 events found
No Events found!


pcmeiners
6 Operator
•
1.8K Posts
0
August 15th, 2007 19:00
TXMRS
5 Posts
0
August 15th, 2007 20:00
Thank you for the response. Unfortunately, the way the software works, the data has to all be on one volume, so we had to create the almost 1TB array and needed 15k rpm drives. When the server was built, 146GB 15k's were the largest available, hence the excessive number of drives (would have purchased the 300GB's if they were 15k at the time).
The volume is set to Adaptive Read-Ahead (Read Policy), Write-Back (Write Policy), and Direct I/O as the cache policy. We were not able to benchmark the different stripe sizes using the production software.
I had not run QCheck, but I downloaded and installed it on the two servers, and it's reporting 800Mbps throughput and all response times at 1ms using maxed out testing values. Both servers are on the same gigabit switch, with no errors reported on any device. All cabling is CAT6 certified and individually tested with our Fluke CAT6 network tester.
I have not messed around with the flow control, but I will certainly look into it. I'm wondering if it is Windows 2003, but in reality, 5+ million files really isn't that much these days, so I would think it could handle it.
I am also running a defrag on the volume, although it looks like it may take >48 hours.
I have looked at the real-time filesystem calls using Filemon from SysInternals, and it was showing response times to be >1s in some cases, even when there were no other calls being made. But other times it was in the 1/1000th second area. The backup is processed, indexed, and run locally, with just the data stream being sent to the backup server, so the calls are coming from the local system and not the backup server. Processor and memory load are almost negligible while the backup is running (processor averaging 5%, memory only about 20MB used by the backup program, and >3GB free system memory)
If I have overlooked anything, please let me know! Thanks again!
Brian
pcmeiners
6 Operator
•
1.8K Posts
0
August 16th, 2007 11:00
You do your homework, so I assume all your firmwares and drivers are up to date, and all drives are at the same firmware revision. Pleaure to see a poster provide good info.
I really do not think it is the raid subsystem, as the PERC 4e/Di controller is a very fast adapter. What is the disk allocation unit size? Do your other machines have the 32k stripe size ? Any idea of what the raid card utilization (cpu ), normally and during backup?
again it not the cause of your major bottleneck... you can divided any array over multiple channels, has nothing to do with the creation of volumes. The point is 8 drive are beyond SCSI satuation point, wear as if you had divided both arrays over the two channels, it would not be. Likely the server does not lose any speed, as setup, unless highly stresses, not critical…but next time consider it
"800Mbps", OK. Since you have no errors, you should try disabling flow control on machines and switch, this will generally give you more network throughput, again this is not a major bottleneck. If toggling it off does nothing, disable it on all machines and leave it enabled on the switches; it is redundant to have it enabled at the NICs and at the switch. The were a few posts about slow downs with the TCP offload engine parameters enabled ( something to look into)
Registry hacks,
disable Windows file timestamps
http://www.windowsnetworking.com/nt/registry/rtips71.shtml
disable 8.3 name creation
http://www.jsifaq.com/SF/Tips/Tip.aspx?id=0026
This hack can cause installation issues with older software installs, such as Veritas 8.5, 8.6
Backup the original regies, and document before changing the above.
Any Windows 2000 machines involved?, if so turn off SMB signing, if security on your network allows. I turn disable it at all my clients with win2000 and win2003 mixed environments
http://support.microsoft.com/default.aspx?scid=kb;en-us;Q321169
You might want to try process monitor.
http://www.microsoft.com/technet/sysinternals/utilities/processmonitor.mspx
Never used it but diskmon
http://www.microsoft.com/technet/sysinternals/utilities/diskmon.mspx
Not critical, and it sound impractical at your site…at most of my clients, I am able to do a boot time defrag, every couple months with Diskeeper, It speeds up programs which remain resident during a normal defrag such as SQL , Exchange etc
“I'm wondering if it is Windows 2003, but in reality, 5+ million files really isn't that much these days”
Same opinion, can you disable the AV scanning during backup only
I am running out of ideas, anyone else have any?
Message Edited by pcmeiners on 08-16-2007 07:59 AM
DELL-GaryS
777 Posts
0
August 17th, 2007 19:00
pcmeiners
6 Operator
•
1.8K Posts
0
August 18th, 2007 16:00
TXMRS
5 Posts
0
October 26th, 2007 13:00
Most of the performance was related to the backup software. We did a comparison test vs. NTBackup and NTBackup was performing at the levels of performance we were expecting. We are working with the software vendor for optimizations.
However, this past weekend, we updated the firmware across the board on the server. We thought we were up-to-date, but turns out we were a few months behind. BTW, Dell, your driver/firmware notification system DOES NOT WORK. The only notices I've received were for some network drivers, and I signed up for everything. So please, fix it so I don't have to check the site every week.
In any case, we upgraded to BIOS A06 from A05, to BMC 1.7x from 1.5x (sorry, can't remember the exact numbers), to PERC 4e/Di 522D from 522A, and applied the drive firmware updates from Dell's updating ISO.
And now performance is absolutely terrible. Our backups are running 5X slower (only processes about 1.2 million files in 13 hours, whereas it used to be able to do the 5.5 million in that time). So, we haven't had a full backup since last week.
I'm opening a ticket with Dell, but sometimes they aren't too helpful on things like this, especially if the system has to remain up in the meantime.
I have run consistency checks on all of the RAID volumes, chkdsk on all drives, and there are no anomalies or errors. We did also apply about 10 Windows updates, but none should be related, and did not affect performance on other systems with smaller arrays (backups take the same amount of time on those systems).
Otherwise, nothing has changed on the system since before the updates, so I'm pretty sure the performance issues are related to the firmware updates.
Also, both the NIC and the PERC are running "fake" high IRQs, so I'll see if I can free up some lower IRQs and give the PERC one of those.
Thank you for all of your help and suggestions!
DELL-GaryS
777 Posts
0
October 30th, 2007 15:00
It's a critical update to fix a problem that could be related to your performance issue. Read the release notes.
Regards,
Dell-GaryS
TXMRS
5 Posts
0
October 30th, 2007 18:00
Thank you. I tried this last night, the consistency check ran, and performance is the same. I just learned of the new firmware revision yesterday (it only came out on the 25th). I was hopeful but so far it has not helped.
Thank you!
DELL-GaryS
777 Posts
0
October 30th, 2007 20:00
pcmeiners
6 Operator
•
1.8K Posts
0
October 30th, 2007 20:00
Message Edited by pcmeiners on 10-30-2007 04:44 PM
TXMRS
5 Posts
0
October 31st, 2007 15:00
Thank you for your feedback. I just got off the phone with my Dell support rep (VERY helpful and knowledgeable). He's been working on this for a few days now, and he just heard from the PERC developers that 522D introduced an algorithm change that affects RAID5 volumes with more than 5 disks. Apparently this algorithm (supposedly parity calculations) is slower than the older algorithm, but safer for the data. I never had any data corruption with 522A, so I have reverted back to 522A (and my Dell rep said that if we did not have problems before that we should hopefully be okay), and it is running a background initialization on the volume right now (he said it's updating and rewriting all of the parity data, but nothing is being lost). Once that completes, I'll run a consistency check and then a backup speed test. Hopefully I'll be back to the old performance levels (which wasn't stellar, but it was the software, as I mentioned in an older post).
The firmware developers have heard of these performance issues in some cases, so I'm hoping that their next release (after 5A2D, since performance was not improved with 5A2D) will have both the safety AND performance.
I'll post an update once I'm able to do some performance testing.
Thanks again!
pcmeiners
6 Operator
•
1.8K Posts
0
October 31st, 2007 16:00
gtrplyr
9 Posts
0
November 27th, 2007 15:00