Saturday, August 16, 2014

Bug 16562733 : NODE EVICTION DUE TO FAILED IO OF VOTING DISK FROM CELL SERVER

From couple of days, In our Exadata Environment we were facing issue of rebooting Exadata Database Servers Intermittently. It was due to cssd was crashing due to voting disk offline. 


We could see that the VF IO error reported and then Voting disk went offline which triggered the node reboot randomly. 



ocssd.log ( inblrdrdbadm01) 

~~~~~~~~~~~~~~~~~~~~~~~~~~~ 
2014-08-05 16:27:19.041: [ SKGFD][1107020096]ERROR: -10(OSS Operation ioerror failed with error 12 [Network error]
)
2014-08-05 16:27:19.041: [ CSSD][1107020096](:CSSNM00060:)clssnmvReadBlocks: read failed at offset 16 of o/192.168.10.4/DBFS_DG_CD_02_inblrdrceladm02
2014-08-05 16:27:19.041: [ CSSD][1107020096]clssnmSetupReadLease: status 1
...............
2014-08-05 16:27:26.021: [ CSSD][1097800000]clssnmvStatusBlkInit: myinfo nodename inblrdrdbadm01, uniqueness 1407150297
2014-08-05 16:27:26.021: [ CSSD][1097800000]clssnmvDiskAvailabilityChange: voting file o/192.168.10.4/DBFS_DG_CD_02_inblrdrceladm02 now online
2014-08-05 16:27:26.022: [ SKGFD][1107020096]ERROR: -10(OSS Operation oss_open failed with error 5 [Failed to connect to a cell]
)
2014-08-05 16:27:26.022: [ CSSD][1107020096]clssnmvGetDiskHandle: Unable to open disk o/192.168.10.4/DBFS_DG_CD_02_inblrdrceladm02
2014-08-05 16:27:26.022: [ CSSD][1107020096]clssnmvWorkerThread:failed to open o/192.168.10.4/DBFS_DG_CD_02_inblrdrceladm02
2014-08-05 16:27:26.022: [ CSSD][1107020096]###################################
2014-08-05 16:27:26.022: [ CSSD][1107020096]clssscExit: CSSD signal 11 in thread clssnmvWorkerThread
2014-08-05 16:27:26.022: [ CSSD][1107020096]###################################
2014-08-05 16:27:26.022: [ CSSD][1107020096](:CSSSC00012:)clssscExit: A fatal error occurred and the CSS daemon is terminating abnormally
2014-08-05 16:27:26.022: [ CSSD][1107020096] 


cssdOUT.log 
~~~~~~~~~~~~ 
08/04/14 16:27:26: CSSD starting
08/04/14 16:34:56: CSSD starting
08/05/14 16:27:26: CSSD handling signal 11
08/05/14 16:27:26: Dumping CSSD state and exiting 



With the help of Oracle Support, they further diagnosed with the same issue. 


----- Call Stack Trace ----- 

calling call entry argument values in hex 
location type point (? means dubious value) 
-------------------- -------- -------------------- ---------------------------- 
clssscExit()+740 call kgdsdst() 000000000 ? 000000000 ?
041FB6D28 ? 000000001 ?
7FE200000001 ? 000000003 ?
s0clsssc_sighandler call clssscExit() 7FE27C1D87C0 ? 000000002 ?
()+611 041FB6D28 ? 000000001 ?
7FE200000001 ? 000000003 ?
__sighandler() call s0clsssc_sighandler 00000000B ? 000000002 ?
() 041FB6D28 ? 000000001 ?
7FE200000001 ? 000000003 ?
clsfInitIO()+40 signal __sighandler() 7FE27C1FA3B0 ? 7FE27CCFF990 ?
000000000 ? 000000004 ?
000000001 ? 7FE280129A00 ?
clssnmvReadBlocks() call clsfInitIO() 7FE27C1FA3B0 ? 7FE27CCFF990 ?
+1136 000000000 ? 000000004 ?
7FE200000001 ? 7FE280129A00 ?
clssnmvVoteDiskVali call clssnmvReadBlocks() 7FE27C1D87C0 ? 000E2A730 ?
dation()+130 000000000 ? 000000004 ?
000000004 ? 7FE280129A00 ?
clssnmvWorkerThread call clssnmvVoteDiskVali 7FE27C1D87C0 ? 7FE27C025FC0 ?
()+1183 dation() 000000000 ? 000E2A730 ?
000000004 ? 7FE280129A00 ?
clssscthrdmain()+25 call clssnmvWorkerThread 7FE27C1D87C0 ? 7FE27C025FC0 ?
3 () 000000000 ? 000E2A730 ?
000000004 ? 7FE280129A00 ?
start_thread()+221 call clssscthrdmain() 7FE27C1D87C0 ? 7FE27C025FC0 ?
7FE27C025FC0 ? 000E2A73 


The above call stack has reported for every node reboot time. 

This is due to Bug.16562733 CSSD crash / node eviction due to failed IO against the voting disk 

Rediscovery information 
~~~~~~~~~~~~~~~~~~~~~~~~ 
cssd may crash when a voting disk open fails 

Rediscovery Notes:
cssd crashes with a stack like:
clssscExit()+740
s0clsssc_sighandler
__sighandler()
clsfInitIO()+40
clssnmvReadBlocks()
clssnmvVoteDiskValidation()+130
clssnmvWorkerThread
clssscthrdmain()+25
start_thread()+221 


Workaround: 
None 

As per plan, we were applied 11.2.0.3 Bundle Patch 24 (Jul-2014 PSU) which even fixed our this issue.

The bug.16562733 has been fixed in 11.2.0.3 BP 24 as part of Grid PSU 11.2.0.3.9 

No comments:

Post a Comment