PowerEdge: Why do hard disks fail
Summary: This article explains the different reasons hard drives can fail in detail.
Instructions
Table of Content
- Firmware Corruption and Damage to the firmware zone
- Electronic Failure
- Mechanical Failure
- Logical Failure
- Media errors
- SCSI/SAS Environment
Firmware Corruption and Damage to the firmware zone
When the firmware of a hard disk becomes corrupted or unreadable, the computer is often unable to correctly interact with the hard disk
Electronic Failure
Electronic failure usually relates to problems on the controller board of the hard disk. The server may suffer a power spike or electrical surge that knocks out the controller board on the hard disk making it undetectable to the controller BIOS.
Mechanical Failure
Mechanical failure can often (especially if not acted on early) lead to a partial and sometimes total loss of data. Mechanical failure comes in various guises such as read/write head failure and motor problems. One of the most common mechanical failures is a head crash. Varying in severity, a head crash occurs when the read/write heads of the hard disk come into contact, momentarily or continuously, with the platters of the hard disk.
Head crashes can be caused by a range of reasons including physical shock (such as dropping the disk on the floor), movement of the computer, static electricity, power surges and mechanical read/write head failure.
Logical Failure
Often the easiest and the most difficult problems to deal with, logical errors can range from simple things such as an invalid entry in a file allocation table to truly horrific problems such as the corruption and loss of the file system on a severely fragmented drive.
Logical errors are different to the electrical and mechanical problems above as there is usually nothing 'physically' wrong with the disk but the information bits on it.
Media Errors
Bad sectors are areas of the hard disk that become unreadable. All hard disk drives develop bad sectors eventually, sectors that go bad are marked by the hard disk and not used any further, but if you have data that reside on sectors that become bad sectors, you cannot access your data or files correctly. Harsh operating conditions (such as High temperatures, vibration, and so on) can cause hard disks to develop many bad sectors quickly. Every type of hard disk is prone to develop bad sectors 'naturally', but this is not always the case.
SCSI/SAS Environment
SCSI hard disks are often regarded as the high-performance drives. They spin faster than their IDE/SATA counterparts, and so, data transfer speeds are often quicker. Because of this, SCSI drives are often found in servers that have to provide a lot of data throughputs. However this performance often comes at a price as mechanical failures are more likely on these drives.
The most common cause of multiple disk failure in this environment is poor signal quality across the SCSI Bus. Poor signal quality results in SCSI protocol overhead as it tries to recover from these problems(timeouts and bus resets). As the system becomes busier and demand for data increases, the corrective actions of the SCSI protocol increase and the SCSI bus become closer to saturation. This overhead eventually limits the normal device communications bandwidths and if left uncleared, one or more SCSI devices may not be able to respond to the RAID controller in a timely manner resulting in the RAID controller marking the hard disk drive offline. These types of signal problems can be caused by improper installation of the RAID controller in a PCI slot, poor cable connections, poor seating of the disks against the SCSI backplane, improper installation or seating of backplane daughtercards, and improper SCSI bus termination.
Combinations of these failure types are also possible.
All technicians and customers should read and understand the maintenance best practices in order to maximize uptime and help prevent data loss as a result of hard disk failure.