Failure Modes and Effects Analysis for Oracle 9i RAC on Linux using Dell Best Practices. - KB Article - 133176

Failure Modes and Effects Analysis for Oracle 9i RAC on Linux using Dell Best Practices. - KB Article - 133176


Contingency Planning

Figure 1 - A Failure Tolerant Solution designed using Dell Best Practices

Mission critical solutions should allow for the possibility of a module failure. Given the likelihood that a failure of such components within a complex solution over a period of time is non-negligible, a solution designed to withstand component failures is a requirement for many enterprises. Another paramount requirement to deal with such failures is planning in advance. Following Dell Best Practices allows the administrators in an enterprise to create solutions that can withstand components failures. An example Oracle 9i Database solution is shown in Figure 1. This solution created in accordance with Dell Best Practices accounts for components failures. With a mechanism for backup and recovery of the storage subsystem, this solution removes all single points of failures.



Disaster Recovery

For Enterprise users, it is critical that they have a well-defined disaster recovery plan. Such a plan would allow the administrators to restore database service in the least amount of time, and the most complete database content, after a serious and unanticipated production service disruption. The following is an overview of some different backup and recovery methodologies available.

Tape Backup

An inexpensive backup methodology is tape backup. Using third party tools such as LEGATO NetWorker™ software, or Veritas BackupExec ™, regular backups can be made on a tape drive, and the tapes can be stored off-site. If the data gets corrupted, then the tapes can be used to restore the database to any point in time. For a detailed analysis of two different tape backup solutions for a 9i RAC database using LEGATO NetWorker, see the white paper Backing Up and Restoring an Oracle Database.

Synchronous Mirroring

EMC provides applications such as MirrorView™ that provide synchronous mirroring of data between different SANs. The secondary backup SAN would always have a copy of the production data on the primary site. Thus if the primary site were to fail, the secondary system can be brought online very quickly. Such a setup allows the administrators to maintain a hot backup of the database system.

EMC’s SnapView™ allows the administrators to capture point in time snapshots of the production data. So in case of a SAN failure due to a failed disk drive, the SAN can be quickly restored to the pre-failure stage with just a simple disk replacement.

For a white paper on creating a remote backup of the SAN, see the white paper Using SnapView and MirrorView for Remote Backup.



Component Failures and their Effects

In spite of all the testing performed, it cannot be realistically expected to assume that a working component in a system will last forever. All components of such complex solutions albeit designed to adhere to strict quality standards can be expected to fail over long periods of time. The following sections detail the expected behavior of such component failures on the Oracle 9i database solution, and how to minimize the effects of such failures.

Server Component Failure

  • Internal Disks - If the Oracle database is running with a single disk, and the disk fails, the administrator has to replace the internal disk and install Oracle software afresh on this node. After having installed the software on the system, consult the Dell Deployment Guide, under the section that details how to add a newly deployed node to the cluster.

    But Best Practices dictate that the internal disks hosting the operating system as well as the database software be installed in at least RAID 1 configuration, and that administrators plan for a few spare disks. Since all supported Dell platforms have hot plug hard drives, if a single disk in such a RAID configuration were to fail, the disk could be easily removed and replaced with another one on a live system, without any impact to the cluster operation. For details on how to rebuild such a failed disk, check the documentation that was provided with your PERC card or the internal ROMB that was included in your system.

  • CPU - If a node with a single CPU were to fail, the system would shut down. Oracle database cluster software has the intelligence to take out such a failed node from its cluster. This CPU would have to be replaced with another one immediately. If the node has multiple CPUs and a single CPU fails, this system would also shut down. But to allow this node to function temporarily until a replacement CPU is procured, the failed CPU can be taken out of the system and the system can be restarted. The only negative effect of the system with one lesser CPU will be a lower performance on that system.

    Once the CPU has been replaced, start the database on this node. This node will join the cluster without any intervention by the administrator.

  • Server Node - If a node fails, the Oracle database software will remove the node from the cluster. This would mean that the performance of the entire cluster would be reduced just because of a missing node. In the meantime, a new node should be immediately procured and installed. To add this node to the cluster, consult the Dell Deployment Guide on how to add a new node to a cluster.

    The health of individual nodes can be monitored using Dell SNMP tools such as OpenManage Server Administrator. To get more information about how to automate the monitoring, see the PowerSolutions paper Using Server Administrator to Automate the Monitoring of Server Health.

Cluster Interconnect Failures

  • Private Network Interface Card (NIC) - In a case where the private interface fails, the database software will regard that node as a failed node, and will remove it from the cluster. This would lead to a lower performance for the entire cluster. To recover from such a failure, remove the NIC from the system (or disable in BIOS if it is on the motherboard) and insert a new NIC. Also, if running OCFS, then stop the service cfs, delete the last line of the /etc/ocfs.conf, re-run ocfs_uid_gen -c, and start the cfs and ocmstart service. On confirming that this node can communicate with all the other nodes in the cluster, start the database on this node. Oracle software will allow the node to join the cluster itself with little administrator intervention.

    To avoid the negative repercussions of an interface failure, Dell Best Practices recommend using NIC teaming, thereby allowing a redundant interface for the network connections. In case of the failure of a single interface, the presence of the other interface would offer enough redundancy that the failure would not adversely effect the functioning of the cluster, as the failover would be instantaneous. To configure NIC teaming, consult the section “Configuring Interconnect Redundancy” in the Dell Deployment Guide.

  • Private Network Switch - If the entire private network fails, this would cause all the nodes to go down. A user from the external network will not be able to access the database at all. To recover from such a failure, replace the switch and after confirming that all the nodes can communicate with each other, start the database on all the nodes. The nodes will join the cluster themselves.

    As a safeguard to prevent against such failures, NIC teaming can be used in conjunction with two redundant switches. This creates two separate private networks, and as such the failure of one network does not cause all the database to be shutdown on all the nodes. If a switch in such a setup were to fail, it can take the other switch up to 90 seconds to take over, and since the cluster manager is set to 200 seconds on Dell installs, the transition would require no intervention by the administrator.

LAN Connection

External Network Interface Card: If the external interface fails, external users will lose connectivity to that node. The node will not be removed from the cluster, since it will continue responding to other nodes on the private interface. As such the performance of the cluster will be degraded, as the rest of the nodes will take over the users previously being routed to this node. To recover from such a failure, shutdown the node and replace the interface that has failed (or switch off in BIOS if it is an onboard interface, and replace with an interface card). After switching the system on, ensure that it can communicate with all the other nodes on the external network. Once the database on this node is started, this node will join the cluster.

To ensure against external network failures also, Dell recommends using NIC teaming and redundant external switches.

Fiber Channel Switch Fabric

Host Bus Adaptor (HBA): In case of the failure of the HBA on a system, the node will not be able to see the external fiber channel storage system. This will cause the node to be taken out of the cluster. Replace the HBA on this node and using the SAN management tools, reconfigure the HBA to join to the same storage group that it was in previously. Once the node can see the shared storage, restarting the database on the node will cause it to be joined back in the cluster.

To avoid such a failure, Dell Best Practices recommend that database solution be deployed with at least two HBAs in conjunction with EMC’s PowerPath software. This allows for creation of redundant pathways to the SAN from the node. Under normal conditions, the multiple pathways will provide I/O load distribution. In event of a single HBA failure, the other pathway would keep the database from shutting down. The system would perform, albeit at lower performance, until the HBA is replaced. To install PowerPath on a functional database, consult the knowledge base articles titled “Migrating an OCFS database to OCFS with PowerPath”, and “Migrating an OCFS database to OCFS with PowerPath”

Fiber Channel Switch: A Fiber Channel switch failure means that all the pathways from all the nodes to the SAN are blocked. This causes the database to be shutdown on all the nodes. To recover from such a failure, replace the fiber channel switch and restart the database on all the nodes after ensuring that each node can see the shared storage.

To mitigate the risks of such a failure, database administrators can install PowerPath on the cluster, and using two or more HBAs in each node and two fiber channel switches create at least two separate paths from each node to the SAN. Thus even if one of the switches were to fail, the other path would still keep the database functional, with some performance degradation until the switch is replaced. To install PowerPath on a functional database, consult the knowledge base articles titled “Migrating an OCFS database to OCFS with PowerPath”, and “Migrating an OCFS database to OCFS with PowerPath”

Storage Subsystem: In case of a storage subsystem failure, the potential for data loss is very high. Such a failure would cause the entire database to shutdown, as none of the nodes will be able to access the storage.

If the storage system failure is caused due to a component such as a power supply or a storage processor that can easily replaced, then the data loss might not be very high. On replacing the affected component, the storage system can be brought back up easily. But if some of the disks are affected, then the actual data will be lost. Dell Best Practices recommend configuring the storage groups in at least a RAID 5 configuration. This would afford some data protection, but still a backup solution is a requirement to ensure no loss in case of a failure. A tape backup solution can be employed to make regular backups of the SAN. If an enterprise cannot afford to have the database offline even for a very small amount of time, then the SAN can be mirrored with a hot backup.



Introduction to Oracle Clusters

Triple-nine or 99.9% availability in a system means that it is down for about 8 hours and 46 minutes in a single year. Depending on what the system is, the cost to a business enterprise of such a downtime can mean up to 5 million dollars (by Standish Group, 2001). Oracle 9i RAC solutions are oft leveraged for business critical databases where high availability and predictability are crucial; proper planning can help prevent failure from occurring, and optimal failover to occur, preserving the overall system uptime. In this paper the authors have attempted to document the behavior of a database cluster in case of component failure, as well as the best practices to prevent failure and to recover in case the cluster fails.

Introduction to Oracle Clusters

The Oracle cluster consists of up to 8 Dell servers, a shared storage system (Storage Area Network – SAN, or Dell PowerVault SCSI Storage), Red Hat Linux 2.1 Advanced Server operating system, and the Oracle 9i Database software. If a SAN is used as the storage system, then a fiber channel network, using a Brocade Fiber Channel switch, is used to connect it to the servers. For SCSI storage, a SCSI cable connection between the node and the storage unit suffices.

The Oracle RAC clusters require a private and a public network. The private network uses the gigabit Ethernet cards on the servers, which are connected via a Dell PowerConnect switch. Primary communication between the clustered nodes takes place over the private network. Clients and other application servers access the database over the public network.

Dell works with Oracle to develop the installation routines for this solution. The solution thus installed is a tightly bound configuration and users should be careful installing other software (including drivers and utilities), to ensure that they are compatible with the kernel version and other software already installed.

Multi Corporation Development & Testing Effort

The solution’s component development is an effort carried out by each of the vendors independently. Each of the vendors subjects their component to an extensive test cycle, before Dell integrates it into the Oracle solution. The base operating system is maintained and tested by Red Hat. Dell and EMC develop the hardware and components, including the servers and the storage. The 9i RAC software is developed by Oracle.

Solution Testing by Dell

In addition to developing some of the hardware pieces and the installation routines for the Oracle 9i software, Dell Computer Corporation runs the entire integrated solution through a comprehensive test cycle. To provide test coverage for all the supported platforms, Dell tests many different permutations of the database setups. This testing covers the functioning of the different platforms with the different backend storage systems, to ensure that the entire solution is fully functional. The testing methodologies followed by Dell ensure that interconnects are working properly, and that there are no contentions between the components of the database solution. Extended stress tests are also performed on the solution before it is released to external customers. The stress tests simulate a TPCC benchmark load.



 


Quick Tips content is self-published by the Dell Support Professionals who resolve issues daily. In order to achieve a speedy publication, Quick Tips may represent only partial solutions or work-arounds that are still in development or pending further proof of successfully resolving an issue. As such Quick Tips have not been reviewed, validated or approved by Dell and should be used with appropriate caution. Dell shall not be liable for any loss, including but not limited to loss of data, loss of profit or loss of revenue, which customers may incur by following any procedure or advice set out in the Quick Tips.

문서 ID: SLN108952

최종 수정일: 11/14/2010 12:00 AM


이 문서 평가하기

정확함
유용함
이해하기 쉬운
이 문서가 도움이 되셨나요?
지원 미지원
피드백을 보내 주십시오.
의견에는 <>()\와 같은 특수 문자를 사용할 수 없습니다.
죄송합니다. 현재 피드백 시스템은 사용하실 수 없습니다. 잠시 후에 다시 시도하십시오.

피드백을 보내주셔서 감사합니다.