Recently, I've got a chance to perform Exadata Patching Activity on our X3-2 Quarter Exadata Box (2 Compute Nodes + 3 Cell Nodes) in co-ordination with Oracle, which consist of,
1) Upgrade Image of DB Servers from 11.2.3.2.1 to 11.2.3.3.1 in rolling fashion, which is most up-to-date at this moment.
2) Apply Bundle Patch 24 (JUL 2014 - 11.2.0.3.24) for QCPE & QDPE On RAC Oracle Homes
Patch description: "QUARTERLY CRS PATCH FOR EXADATA (JUL 2014 - 11.2.0.3.24) (18906063)"
Patch description: "QUARTERLY DATABASE PATCH FOR EXADATA (JUL 2014 - 11.2.0.3.24) : (18707883)"
3) Run Catbundle in above patch applied RAC Databases.
4) Upgrade Image of Cell (Storage) Servers from 11.2.3.2.1 to 11.2.3.3.1 in rolling fashion
5) Apply patch for Infiniband (IB) Switch
Obviously above these 5 steps took lot of planning & pre-requisites checks.
The reason, we went for Image Upgrade because we hit on following bug which is resolved in this 11.2.3.3.1 Image.
Our one of the disks was showing below status,
Issue:
CellCLI> LIST PHYSICALDISK WHERE DISKTYPE=harddisk
20:0 R7K4ND normal
20:1 R7LPXD normal
20:2 R7P2ND normal
20:3 R7ESGD normal
20:4 R7H27D warning - poor performance
20:5 R7PK9D normal
20:6 R7GWJD normal
20:7 R7PL2D normal
20:8 R7DN1D normal
20:9 R7EASD normal
20:10 R748SD normal
20:11 R6X83D normal
[root@inblrdrceladm01 ~]# cellcli -e list griddisk attributes name,status,asmmodestatus,asmdeactivationoutcome
DATA_DR_CD_00_inblrdrceladm01 active ONLINE Yes
DATA_DR_CD_01_inblrdrceladm01 active ONLINE Yes
DATA_DR_CD_02_inblrdrceladm01 active ONLINE Yes
DATA_DR_CD_03_inblrdrceladm01 active DROPPED Yes
DATA_DR_CD_04_inblrdrceladm01 proactive failure DROPPED Yes
DATA_DR_CD_05_inblrdrceladm01 active ONLINE Yes
DBFS_DG_CD_02_inblrdrceladm01 active ONLINE Yes
DBFS_DG_CD_03_inblrdrceladm01 active DROPPED Yes
DBFS_DG_CD_04_inblrdrceladm01 proactive failure DROPPED Yes
DBFS_DG_CD_05_inblrdrceladm01 active ONLINE Yes
RECO_DR_CD_00_inblrdrceladm01 active ONLINE Yes
RECO_DR_CD_01_inblrdrceladm01 active ONLINE Yes
RECO_DR_CD_02_inblrdrceladm01 active ONLINE Yes
RECO_DR_CD_03_inblrdrceladm01 active DROPPED Yes
RECO_DR_CD_04_inblrdrceladm01 proactive failure DROPPED Yes
RECO_DR_CD_05_inblrdrceladm01 active ONLINE Yes
CellCLI> list cell detail
cellsrvStatus: stopped
msStatus: running
rsStatus: running
cellsrvStatus was stopping automatically on 1st node due to this reason.
Cause, we found it after raising a SR:
We were hitting Bug:17021128 : NIWOT "CHIP PAUSED" CAUSES HIGH SERVICE TIME ON ALL DRIVES
Storage Servers where LSI MegaRaid firmware is below 12.12.0-0178. This has been observed primarily on systems running Exadata Storage Software 11.2.3.2.0, 11.2.3.2.1 where LSI MegaRaid firmware is 12.1.2.0-0140. I have identified that you are on this version of firmware.
Further evidence of this are the "Chip 0 Pause" messages in the MegaCli firmware log which I found during my investigation:
06/26/14 8:34:06: [9e]= 1 [a0]= f [a2]= 9 ^M
06/26/14 8:37:59: DM: Chip 0 Paused^M
06/26/14 8:37:59: Chip <0> Slots: Cur=[133]^M
06/26/14 8:37:59: [87]= 3 [89]= b [8b]= f [8e]= f [90]=14 ^M
06/26/14 8:41:11: DM: Chip 0 Paused^M
06/26/14 8:41:11: Chip <0> Slots: Cur=[69]^M
06/26/14 8:41:11: [47]= 1 [4a]= c [4c]=17 [4e]= e [50]= e ^M
06/26/14 8:41:16: DM: Chip 0 Paused^M
06/26/14 8:41:16: Chip <0> Slots: Cur=[74]^M
06/26/14 8:41:16: [4c]= 1 [4e]= e [50]= e ^M
06/26/14 8:43:23: DM: Chip 0 Paused^M
06/26/14 8:43:23: Chip <0> Slots: Cur=[201]^M
To resolve this issue, the following is recommended:
Install LSI MegaRaid firmware version 12.12.0-0178 which is included in 11.2.3.2.2 or 11.2.3.3.0
Solution: (Remember this is Rolling method, so no need to down any EBS, Siebel, Hyperion or any other Applications)
1) Upgrade Image of DB Servers from 11.2.3.2.1 to 11.2.3.3.1 in rolling fashion
./dbnodeupdate.sh -u -l /u01/patches/YUM/p18876946_112331_Linux-x86-64.zip -v
./dbnodeupdate.sh -u -l /u01/patches/YUM/p18876946_112331_Linux-x86-64.zip -b (for backup)
./dbnodeupdate.sh -u -l /u01/patches/YUM/p18876946_112331_Linux-x86-64.zip -n (for execution)
./dbnodeupdate.sh -c (after patching and node reboot)
2) Apply Bundle Patch 24 (JUL 2014 - 11.2.0.3.24) On RAC Oracle Homes (manual method)
On ASM/Grid Home first, followed by on rest RDBMS RAC Homes. (We have 11 RDBMS Homes for different Applications)
Stop agents (if running)
Check & Rollback for conflicting patches, if any
Run preprepatch script
Apply QUARTERLY CRS PATCH FOR EXADATA (JUL 2014 - 11.2.0.3.24) (18906063) & QUARTERLY DATABASE PATCH FOR EXADATA (JUL 2014 - 11.2.0.3.24) : (18707883 - this require only on DB Home and not on Grid Home)
Run Postpatch Script
Apply conflicting patch
start CRS with rootcrs.pl -patch
crsctl check crs
Start Both EM Agents
crsctl stat res -t
3) Run Catbundle in above patch applied RAC Databases.
su - oracle
. oraenv
export ORACLE_SID=<instance_name>
cd $ORACLE_HOME
sqlplus "/ as sysdba"
Check for invalid objects
======================================
column comp_name format a40
column version format a12
column status format a15
select comp_name,version,status from dba_registry;
---------------------------------------------------
column owner format a15
column object_name format a40
column object_type format a20
select owner, object_name, object_type from dba_objects where status='INVALID' order by object_type,owner,object_name;
If there are a lot of invalids, this next command will list only the invalids containing SYS in the owner
-----------------------------------------------------------------------------------------------------------
select owner, object_name, object_type from dba_objects where status='INVALID' and owner like '%SYS%' order by object_type,owner,object_name;
---------------------------------------------------
@?/rdbms/admin/utlprp.sql 16
select comp_name,version,status from dba_registry;
------------------------------------------------
select capture_name from dba_capture where capture_name not like 'OGG$%';
select apply_name from dba_apply where apply_name not like 'OGG$%';
-------------------------------------------------------
select capture_name from dba_capture where capture_name not like 'OGG$%';
exec dbms_capture_adm.stop_capture('capture_name');
select apply_name from dba_apply where apply_name not like 'OGG$%';
exec dbms_apply_adm.stop_apply('apply_name');
-------------------------------------------------------------
@?/rdbms/admin/catbundle.sql exa apply
@?/rdbms/admin/utlprp.sql 16
-------------------------------------------------------------
set lines 200
column owner format a15
column object_name format a40
column object_type format a20
col comp_name for a60
select comp_name,version,status from dba_registry;
select owner, object_name, object_type from dba_objects where status='INVALID' order by object_type,owner,object_name;
If there are a lot of invalids, this next command will list only the invalids containing SYS in the owner
select owner, object_name, object_type from dba_objects where status='INVALID' and owner like '%SYS%' order by object_type,owner,object_name;
-----------------------------------------------------------
select capture_name from dba_capture where capture_name not like 'OGG$%';
exec dbms_capture_adm.start_capture('capture_name');
select apply_name from dba_apply where apply_name not like 'OGG$%';
exec dbms_apply_adm.start_apply('apply_name');
--------------------------------------------------------------
Check that the apply finished successfully
set lines 200
col ACTION_TIME for a40
col COMMENTS for a40
select * from dba_registry_history;
4) Upgrade Image of Cell (Storage) Servers from 11.2.3.2.1 to 11.2.3.3.1 in rolling fashion
cd /u01/patches/CELL/patch_11.2.3.3.1.140708
#./patchmgr -cells cell_group -patch_check_prereq -rolling
The output should be cleaned w.r.t above command for each Cell Node.
Check repair times for all mounted disk groups in the Oracle ASM instance and adjust if needed
========================================================================
su - oracle
. oraenv <<EOF
+ASM1
EOF
sqlplus / as sysasm
select dg.name,a.value from v$asm_diskgroup dg, v$asm_attribute a where dg.group_number=a.group_number and a.name='disk_repair_time';
If the repair time is not 3.6 hours then note the value and the diskgroup names. Replace <diskgroup_name> in the following statement to adjust.
alter diskgroup '<diskgroup_name>' set attribute 'disk_repair_time'='3.6h'; ### Set it to Higher Side
Repeat the above statement for each diskgroup
2) Increase ASM Power with Limit asm_power_limit parameter
3) Check no V$asm_operation is currently going before starting Cell Patching Activity
Cell Patching in Rolling Upgrade (Initiate from DB Node, root user)
========================================================================
[[oracle@inblrdrdbadm01 patch_11.2.3.3.1.140708]$ pwd
/u01/patches/CELL/patch_11.2.3.3.1.140708
[oracle@inblrdrdbadm01 patch_11.2.3.3.1.140708]$ cat cell_group
inblrdrceladm01
inblrdrceladm02
inblrdrceladm03
[oracle@inblrdrdbadm01 patch_11.2.3.3.1.140708]$ cat dbs_group
inblrdrdbadm01
inblrdrdbadm02
[oracle@inblrdrdbadm01 patch_11.2.3.3.1.140708]$ cat inblrdrceladm01
inblrdrceladm01
[oracle@inblrdrdbadm01 patch_11.2.3.3.1.140708]$ cat inblrdrceladm02
inblrdrceladm02
[oracle@inblrdrdbadm01 patch_11.2.3.3.1.140708]$ cat inblrdrceladm03
inblrdrceladm03
Cleanup space from any previous runs
==================================================
the -reset_force command is only done the first time the cells are patched to this release.
It is not necessary to use the command for subsequent cell patching, even after rolling back the patch.
#./patchmgr -cells cell_group -reset_force (cell_group consist of cell servers hostname, or you can give single hostname file name)
OR
#./patchmgr -cells inblrdrceladm01 -reset_force (Same way inblrdrceladm02 /inblrdrceladm03 file)
-------------------------------------------------------------------------------------------------
Always use the -cleanup option before retrying a failed or halted run of the patchmgr utility.
#./patchmgr -cells cell_group -cleanup
OR
#./patchmgr -cells inblrdrceladm01 -cleanup (Same way inblrdrceladm02 /inblrdrceladm03 file)
Run prerequisites check (The output Should be Clean)
=================================================================
cd /u01/patches/CELL/patch_11.2.3.3.1.140708
#./patchmgr -cells cell_group -patch_check_prereq -rolling
OR
#./patchmgr -cells inblrdrceladm01 -patch_check_prereq -rolling (Same way inblrdrceladm02 /inblrdrceladm03 file)
Patch the cell nodes (in rolling upgrade)
===========================================
# nohup ./patchmgr -cells inblrdrceladm01 -patch -rolling & [Same way for inblrdrceladm02 /inblrdrceladm03 file, only after checking #cellcli -e list griddisk attributes name,status,asmmodestatus,asmdeactivationoutcome should be ONLINE (Not Resyncing) on Cell Node]
SUCCESS: DONE: Execute plugin check for Patch Check Prereq.
1 of 5 :Working: DO: Initiate patch on cells. Cells will remain up. Up to 5 minutes ...
2 of 5 :Working: DO: Waiting to finish pre-reboot patch actions. Cells will remain up. Up to 45 minutes
3-5 of 5 :Working: DO: Finalize patch and check final status on cells. Cells will reboot.
Monitor the patch progress
===================================
cd /u01/patches/CELL/patch_11.2.3.3.1.140708
tail -f nohup.out
Cleanup space
==================
#./patchmgr -cells cell_group -cleanup
OR
#./patchmgr -cells inblrdrceladm01 -cleanup (Same way inblrdrceladm02 /inblrdrceladm03 file)
Post Checks
=================
#imageinfo -version
#imageinfo -status
#uname -r
#imagehistory
#uptime
#dcli -l root -g /opt/oracle.SupportTools/onecommand/cell_group cellcli -e list griddisk attributes name,status,asmmodestatus,asmdeactivationoutcome|more
(The next 5 lines are all one command and should not return any output. If output is returned then disks are still resyncing.)
dcli -g cell_group -l root \
"cat /root/attempted_deactivated_by_patch_griddisks.txt | grep -v \
ACTIVATE | while read line; do str=\`cellcli -e list griddisk where \
name = \$line attributes name, status, asmmodestatus\`; echo \$str | \
grep -v \"active ONLINE\"; done"
so Don't Start another Cell Patching in case if you see Disks are Resyncing, All Disks should be ONLINE)
Change disk_repair_time back to original value
==================================================================
su - oracle
. oraenv <<EOF
+ASM1
EOF
sqlplus / as sysasm
select dg.name,a.value from v$asm_diskgroup dg, v$asm_attribute a where dg.group_number=a.group_number and a.name='disk_repair_time';
If the repair time is not 3.6 hours then note the value and the diskgroup names. Replace <diskgroup_name> in the following statement to adjust.
alter diskgroup '<diskgroup_name>' set attribute 'disk_repair_time'='<original value>';
Repeat the above statement for each diskgroup
exit
5) Apply patch for Infiniband (IB) Switch
cd /u01/patches/CELL/patch_11.2.3.3.1.140708
vi ibswitches.lst
One switch per line. Spine switch listed first as below:
inblrdrsw-iba0
inblrdrsw-ibb0
./patchmgr -ibswitches ibswitches.lst -upgrade -ibswitch_precheck (Pre-requisites Check)
./patchmgr -ibswitches ibswitches.lst -upgrade (Actual Upgrade)
The output should show SUCCESS . If there are errors, then correct the errors and run the upgrade command again.
############################ Post Activities and Checks #############################
CellCLI> ALTER PHYSICALDISK 20:4 reenable force;
Physical disk 20:4 was reenabled.
CellCLI> LIST PHYSICALDISK WHERE DISKTYPE=harddisk
20:0 R7K4ND normal
20:1 R7LPXD normal
20:2 R7P2ND normal
20:3 R7ESGD normal
20:4 R7H27D normal --- issue resolved
20:5 R7PK9D normal
20:6 R7GWJD normal
20:7 R7PL2D normal
20:8 R7DN1D normal
20:9 R7EASD normal
20:10 R748SD normal
20:11 R6X83D normal
Run the below command on ASM1
SQL> alter diskgroup DATA_DR add disk 'o/192.168.10.3/DATA_DR_CD_04_inblrdrceladm01' force,'o/192.168.10.3/DATA_DR_CD_03_inblrdrceladm01' force;
Run the below command on ASM2
SQL> alter diskgroup DBFS_DG add disk 'o/192.168.10.3/DBFS_DG_CD_04_inblrdrceladm01' force,'o/192.168.10.3/DBFS_DG_CD_03_inblrdrceladm01' force;
SQL> alter diskgroup RECO_DR add disk 'o/192.168.10.3/RECO_DR_CD_03_inblrdrceladm01' force,'o/192.168.10.3/RECO_DR_CD_04_inblrdrceladm01' force;
[oracle@inblrdrdbadm01 ~]$ . ASM.env
[oracle@inblrdrdbadm01 ~]$
[oracle@inblrdrdbadm01 ~]$ sqlplus "/as sysasm"
SQL> alter diskgroup DATA_DR add disk 'o/192.168.10.3/DATA_DR_CD_04_inblrdrceladm01' force,'o/192.168.10.3/DATA_DR_CD_03_inblrdrceladm01' force;
Diskgroup altered.
SQL> exit
[oracle@inblrdrdbadm01 ~]$ ssh inblrdrdbadm02
Last login: Thu Aug 14 11:13:52 2014 from inblrdrdbadm01.tajhotels.com
[oracle@inblrdrdbadm02 ~]$ . ASM.env
[oracle@inblrdrdbadm02 ~]$ sqlplus "/as sysasm"
SQL*Plus: Release 11.2.0.3.0 Production on Thu Aug 14 19:24:13 2014
Copyright (c) 1982, 2011, Oracle. All rights reserved.
Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Real Application Clusters and Automatic Storage Management options
SQL> alter diskgroup DBFS_DG add disk 'o/192.168.10.3/DBFS_DG_CD_04_inblrdrceladm01' force,'o/192.168.10.3/DBFS_DG_CD_03_inblrdrceladm01' force;
Diskgroup altered.
SQL> alter diskgroup RECO_DR add disk 'o/192.168.10.3/RECO_DR_CD_03_inblrdrceladm01' force,'o/192.168.10.3/RECO_DR_CD_04_inblrdrceladm01' force;
Diskgroup altered.
[root@inblrdrceladm01 ~]# cellcli -e list griddisk attributes name,status,asmmodestatus,asmdeactivationoutcome
DATA_DR_CD_00_inblrdrceladm01 active ONLINE Yes
DATA_DR_CD_01_inblrdrceladm01 active ONLINE Yes
DATA_DR_CD_02_inblrdrceladm01 active ONLINE Yes
DATA_DR_CD_03_inblrdrceladm01 active ONLINE Yes
DATA_DR_CD_04_inblrdrceladm01 active ONLINE Yes
DATA_DR_CD_05_inblrdrceladm01 active ONLINE Yes
DBFS_DG_CD_02_inblrdrceladm01 active ONLINE Yes
DBFS_DG_CD_03_inblrdrceladm01 active ONLINE Yes
DBFS_DG_CD_04_inblrdrceladm01 active ONLINE Yes
DBFS_DG_CD_05_inblrdrceladm01 active ONLINE Yes
RECO_DR_CD_00_inblrdrceladm01 active ONLINE Yes
RECO_DR_CD_01_inblrdrceladm01 active ONLINE Yes
RECO_DR_CD_02_inblrdrceladm01 active ONLINE Yes
RECO_DR_CD_03_inblrdrceladm01 active ONLINE Yes
RECO_DR_CD_04_inblrdrceladm01 active ONLINE Yes
RECO_DR_CD_05_inblrdrceladm01 active ONLINE Yes
# imageinfo
Kernel version: 2.6.39-400.128.17.el5uek #1 SMP Tue May 27 13:20:24 PDT 2014 x86_64
Cell version: OSS_11.2.3.3.1_LINUX.X64_140708
Cell rpm version: cell-11.2.3.3.1_LINUX.X64_140708-1
Active image version: 11.2.3.3.1.140708
Active image activated: 2014-08-14 13:03:50 +0530
Active image status: success
Active system partition on device: /dev/md6
Active software partition on device: /dev/md8
# imagehistory
Version : 11.2.3.2.1.130109
Image activation date : 2013-09-24 13:49:36 +0530
Imaging mode : fresh
Imaging status : success
Version : 11.2.3.3.1.140708
Image activation date : 2014-08-14 13:03:50 +0530
Imaging mode : out of partition upgrade
Imaging status : success
Finally, This Marathon Activity got Successfully Completed in approx. 18.5 Hours due to Rolling Fashion & had 11 RDBMS Home for BP24 Patching followed by catbundle to run OR Else in Non-Rolling Fashion it would have taken 1/3rd i.e. 6 Hours approx. but again at the cost of downtime, which wasn't possible. :)
Thanks & Have a Happy Reading - Manish