7 Posts

August 29th, 2005 12:00

Thanks,

I've been into Dellmgr and disabled the alarm.  The system ID is B4JT931.  The server has had all the recent Novell patches installed as well as firmware updates from Dell.  The server had been abending and/or hanging until the installation of NW6.5 SP3 and related patches about 30 days ago, so there is a lengthy abend.log.

I'm not sure about array manager.  I only recently began working with this server, and am unfamiliar with dell specific tools, as I primarily have used HP in the past.  I do see a dellmon.nlm.  What is array mgr, and can I download it if it's not installed?

Thanks

2 Intern

 • 

188 Posts

August 29th, 2005 12:00

If you loaded the Dell PEDGE3.HAM, it should have copied DELLMGR.NLM into SYS:SYSTEM. When you run this module from the console, it looks like the CTRL+M PERC BIOS. You can disable the alarm there.

As for the mega_HAM_Timeout error, this is going to take some troubleshooting. It is possible that it is a PERC failure, but I think at this point it is unlikely.

I would start by making sure all software has the latest patches installed. Check to see you an ABEND.LOG exists in SYS:SYSTEM.

Also, you can pull a log from the PERC card itself. It can pulled using Array Manager if you have it installed and configured or Dell technical support can provide you with a tool called TTY that will allow you to pull the PERC log from DOS.

Please post your service tag number and let me know if your server is abending and if you have Array Manager installed.

Carrie

2 Intern

 • 

188 Posts

August 29th, 2005 12:00

Looks like your server has a PERC4dc. Check to see what firmware you have installed. You should be able to find it in DELLMGR in Object - Adapter - Other Adapter information.

Also, all the third party software needs to be patched, which includes your backup software.

As for Open Manage, take a look in the autoexec.ncf towards the bottom and see if you see any comments for Dell software. If you do, copy the lines here and I'll tell you what you have installed. I would also like to know if you have Symantec Antivirus installed.

Carrie

Message Edited by DELL-Carrie on 08-29-2005 09:14 AM

7 Posts

August 29th, 2005 13:00

I believe all 3rd party software has been patched, including my tape backup software (Arcserve 9.1).  The PERC firmware was patched to:351S
 
There is nothing dell specific loading.  I had to manually load dellmon and dellmgr after the drive failure.
 
Thanks for your help.

2 Intern

 • 

188 Posts

August 29th, 2005 17:00

I am about to email you a file to pull the PERC log. You will need to down the server to DOS.

Carrie

7 Posts

August 30th, 2005 13:00

Carrie,
 
I have not received an e-mail yet, but I have tried to use the nwttylog.nlm to extract the PERC's log.  I was able to see entries in the log that started from my last reboot on 08/08,  but unfortunately, this tool did not extract any of the recent log entries.  For whatever reason, the log seems to be fine until it gets to PR log entries on 08/16, prints a "€" character, and then the log duplicates again from the beginning multiple times.
 
I did find the following Novell TID: http://support.novell.com/cgi-bin/search/searchtid.cgi?10096720.htm that was last updated this month, and indicates that the controller's cache memory module is defective.
 
Even though one of the drives was deactivated, it did not show any media errors, and I was able to reactivate and rebuild the offline drive.
 
What do you think?
 
-Thanks
 
 
 
Excerpt from NWTTYLOG.NLM Output:
 
TTY History for HA(0) -Bus 0x01 Device 0x02
                                        € 
 
T0: LSI Logic MegaRAID i960 firmware loaded
T0: Firmware version 351S build on May 31 2005 at 21:05:31
T0: Board is type 1000/1960/1028/0520
 
T0: Can_flush=0
T0: DRAM SIZE=64 MB
T0: DRAM_ALT sig invalid from previous boot
T0: FLUSH_ON_SYSTEM_RESET=0
 
T0: (C) LSI Logic Corporation 2002
T0: MegaRAID Series 520 firmware version 351S
T0: WAIT FOR BIOS....
T0: BIOS UP!
T0: Enabling data cache
T0: EepromInit: Family=14, SN=d50a1d010000
T0: EepromInit: Board SN not programmed
 
T0: Environment data:
T0: VALIDATION=None
 
T0: MFC data:
T0:     Vendor/DeviceID=1000/1960
T0:     SubVendor/SubDevice=1028/0520
T0:     OEM=Dell, clusterDisable=0, flexRaidDisable=0
T0:     rebuildRate=30, stripeSize=128, flushTime=4
T0:     cachedIo=1, writeBack=1, readAhead=0
T0:     channelBase='0', smartMode=6, alarmDisable=0
T0:     fastInitDisable=0, coercion=128M, disablePredictiveFail=0
T0:     disableWebBios=1, disableCtrlM=0, writeThroughWhenBatteryBad=1
T0:     zcrConfig=Undefined, keepSafteFailedStatus=0, autoHotSpareRestore=0
T0:     variableChkConRate=0, enableNvramDiskMgmtChange=0
T0:     dirtyLedShowsDriveActivity=0, disableConsChkRestoration=0
T0:     biosContinueOnError=0, biosAutoConfig=Prompt
T0:     disableRandomDriveDeletion=0
 
T0: DISK_CACHE_ADDR=d0d2c800
T0: MEM_END_ADDR=d3fffff0
T0: Found MPT LVD 1030(1/5/0) at fbff0004/0, mapped to 8bff0000
T0: Total LSI MPT Chips found 1
T0: LSI_InitMPT : start_index 0 totalLSIMPTChips 1
T0:  Verifying Image Signatures...VERIFIED
T0:  Verifying image check sum... VERIFIED
T0: The FW version being loaded is MPTFW-01.03.35.00-IT
T0: NextImageHeaderOffset=9c70, ExtImageSize=818
T0: FW download complete... Expecting LSI FW to start excute and come to ready state
T2: MISM CHN_STATE_MPT_GET_FW_FEAT chip 0
T2: PRESENT SCSI_ID = 0
T2: Changing scsiId to  7
T2: Check IOC FACTS chip 0
T2: MISM CHN_STATE_MPT_OPERATIONAL chip 0
T2: MISM: Reply frame size 60 start addr d0525320
T2: fe reply free frames posted
T2: MISM CHN_STATE_MPT_INIT_BUS_RST chip 0
T3: MPT_Poll: chip 0 CHN_STATE_MPT_INIT_BUS_RST
T5: DISM: Queued!
T5: MPT_ProcessIo Reply Fr 2 EVENT_NOTIFICATION
T5: MPI_EVENT_EVENT_CHANGE
T5: MPT_SetIocPageParameters: After Write CoalescingDepth=1, Timeout=0, Flags=1
T8: DISM_ProcessPprState: DomainVal done on all disks
T9: DISM: Complete!!!
 
T9: Physical device info:
T9: ID  NVRState  Vendor    Product           Rev    6   7  56
T9: --  --------  --------  ----------------  ----  --  --  --
T9: 00  Online    SEAGATE   ST336607LW        DS08  01  3e  0f
T9: 01  Online    SEAGATE   ST336607LW        DS08  01  3e  0f
T9: 02  Online    SEAGATE   ST336607LW        DS08  01  3e  0f
 
T9: battery init: battery backup circuit is not mounted
T9: TBBU: No TBBU h/w
T9: Verifying config struct at Addr e0001400
T9: NVRAM checksum OK - reading configuration
T9: DISK_CACHE_ADDR=d0d2c800
T9: MEM_END_ADDR=d3fffff0
T9: Memory End d3fffff0
T9: Total memory available for disk cache: 32d33f0
T9: Total Number of Cache Lines 811
T9: SS 128: mrs=3  lc=811 ldc=1  ps=1 cm=ff ba=0 LDs: 0
T9: LD  0: L=5  SS=128  Size=8776000  NL=811  Status=2  DT=251  BT=512
T9:        span 0: sBlk=00000000, nBlk=043bb000, dev=00-01-02
T9: can_flush = 0
T9: No Reconst:Checking drive info
T9: MIGRATE: 40LD or 8ld new drive  ch 0 tgt 0
T9: REF drive found at ch 0 tgt 0
T9: Attempting to perform drive roaming
T9: NOT Flushing Cache
T9: RMW: NVRAM structure valid - checking for active RMWs
T9: Inside  SymRequestPoolInit
T9: Memory Pages USed=100
T9: Memory Pages USed=164
T9: RequestQ ADDR=d0210000
T9: exit   SymRequestPoolInit
T9: inside   SymMsgFifoInit
T9: InboundFreeQEnd=d0200000
T9: exit   SymMsgFifoInit
T10: inside   SymMsgFifoInit
T10: exit  SymMsgFifoInit
T10: Memory Pages USed=34a
T10: Memory Pages USed=34e
T10: Memory Pages USed=34f
T10: Memory Pages USed=374
T10: found 1 logical drives
T10: logDrv=0  Size=8776000
08/08 19:22:31: Time established at T68
08/08 19:22:31: BIOS CALL FOR DRV ROAMING : 55
08/08 19:22:31: drive roaming not done
08/08 20:00:00: prDiskStart: starting Patrol Read on PD=00
08/08 20:00:00: prDiskStart: starting Patrol Read on PD=01
08/08 20:00:00: prDiskStart: starting Patrol Read on PD=02
08/08 20:00:00: Next PR scheduled to start at 08/09  0:00:00
08/08 21:16:56: prCallback: PR completed for pd=01
08/08 21:16:56: prDiskDone: finishing PR on PD=01
08/08 21:17:16: prCallback: PR completed for pd=02
08/08 21:17:16: prDiskDone: finishing PR on PD=02
08/08 21:17:24: prCallback: PR completed for pd=00
08/08 21:17:24: prDiskDone: finishing PR on PD=00
08/08 21:17:24: PR cycle complete
08/09  0:00:00: prDiskStart: starting Patrol Read on PD=00
08/09  0:00:00: prDiskStart: starting Patrol Read on PD=01
08/09  0:00:00: prDiskStart: starting Patrol Read on PD=02
08/09  0:00:00: Next PR scheduled to start at 08/09  4:00:00
08/09  1:35:37: prCallback: PR completed for pd=01
08/09  1:35:37: prDiskDone: finishing PR on PD=01
08/09  1:35:49: prCallback: PR completed for pd=00
08/09  1:35:49: prDiskDone: finishing PR on PD=00
08/09  1:36:44: prCallback: PR completed for pd=02
08/09  1:36:44: prDiskDone: finishing PR on PD=02
08/09  1:36:44: PR cycle complete
08/09  4:00:00: prDiskStart: starting Patrol Read on PD=00
08/09  4:00:00: prDiskStart: starting Patrol Read on PD=01
08/09  4:00:00: prDiskStart: starting Patrol Read on PD=02
08/09  4:00:00: Next PR scheduled to start at 08/09  8:00:00
 
(REMOVED FOR BEREVITY)
 
08/16 20:00:00: prDiskStart: starting Patrol Read on PD=00
08/16 20:00:00: prDiskStart: starting Patrol Read on PD=01
08/16 20:00:00: prDiskStart: starting Patrol Read on PD=02
08/16 20:00:00: Next PR scheduled to start at 08/17  0:00:00
08/16 22:44:16: prCallback: PR completed for pd=01
08/16 22:44:16: prDiskDone: finishing PR on PD=01
08/16 22:49:32: prCallback: PR completed for pd=02
08/16 22:49:32: prDiskDone: finishing PR on PD=02
08/16 23:31:04: prCallback: PR completed for pd=00
08/16 23:31:04: prDiskDone:      € 
 
AND LOG REPEATS
 
T0: LSI Logic MegaRAID i960 firmware loaded
T0: Firmware version 351S build on May 31 2005 at 21:05:31
T0: Board is type 1000/1960/1028/0520 ......

2 Intern

 • 

188 Posts

August 30th, 2005 14:00

The tty log looks fine. I hadn't seen the Novell TID you posted.

Please call into Dell technical support and ask them to replace the raid card. Give the Dell tech a link to this forum post. Please let me know if replacing the PERC resolves the error.

Carrie

2 Intern

 • 

188 Posts

August 30th, 2005 15:00

You should get the replacement PERC card today. Let me know if this fixes the error.

Carrie

7 Posts

August 30th, 2005 15:00

I have.  After 3 hrs of telephone time, they tell me this error is a software error, not a hardware problem.  After much wrangling, they agreed to send me a card, which I just received.  I'll let you know the outcome.

Any idea why the log didn't have any of the more recent entries in it?

Thanks again for your assistance.  I really do appreciate it.

12 Posts

September 21st, 2005 11:00

If you look a little further down in this forum you will see that I posted the identical error message.  I wrangled with Dell for a couple of days, lost a ton of data, and then they replaced the card.  It's amazing how the error went away after the card was replaced, note the software is the same, but Dell always blames the software.

10 Posts

September 21st, 2005 20:00

I've had it on 2 servers, myself, and I posted it here before.  The tech I got (lucky me) knew what he was doing, and from those logs, was able to determine "without a doubt", according to him, that this error indicates bad ram on the raid controller.  its not the controller itself.  But both times, they just replaced the controller. 

Must have had a bad production run, as these 2 servers were identical, and purchased at the same time.  both had the problem. 

1 Message

September 22nd, 2005 07:00

Hi Gman5, has your issue been resolved after replacement of the PERC controller? Cause currently I am having a customer tha facing the same issue like yours.

12 Posts

September 22nd, 2005 08:00

I recommend that you replace the controllers immediately.  The problem is bad cache on the controller.  Looks like Dell might have had a bad production run.  When you get your new controllers double-check that the firmware is the latest version (the ones shipped to me were a few revisions behind).  If you continue running your servers with these HAM errors you are likely to experience loss of, or corruption of, data.

7 Posts

September 22nd, 2005 12:00

Replacing the Perc controller did resolve my issues, although I had to argue with support for hours to convince them to send me a replacement.  The only reason I found the error was 1 drive in my array went offline, and in the course of working on it, I found this error, as well as another post here with the same issue.  Novell also has a TID on this error.
 
Luckily, Dell did send me a replacement controller, and just a few days later, I did have a drive failure that could not be corrected without replacement of the drive.
No Events found!

Top