Tuesday, September 16, 2014

Oracle Exadata Patching

Recently, I've got a chance to perform Exadata Patching Activity on our X3-2 Quarter Exadata Box (2 Compute Nodes + 3 Cell Nodes) in co-ordination with Oracle, which consist of,

1) Upgrade Image of DB Servers from 11.2.3.2.1 to 11.2.3.3.1 in rolling fashion, which is most up-to-date at this moment.

2) Apply Bundle Patch 24 (JUL 2014 - 11.2.0.3.24) for QCPE & QDPE On RAC Oracle Homes

Patch description:  "QUARTERLY CRS PATCH FOR EXADATA (JUL 2014 - 11.2.0.3.24) (18906063)"
Patch description:  "QUARTERLY DATABASE PATCH FOR EXADATA (JUL 2014 - 11.2.0.3.24) : (18707883)"

3) Run Catbundle in above patch applied RAC Databases.

4) Upgrade Image of Cell (Storage) Servers from 11.2.3.2.1 to 11.2.3.3.1 in rolling fashion

5) Apply patch for Infiniband (IB) Switch

Obviously above these 5 steps took lot of planning & pre-requisites checks.
The reason, we went for Image Upgrade because we hit on following bug which is resolved in this 11.2.3.3.1 Image.

Our one of the disks was showing below status,

Issue:

CellCLI> LIST PHYSICALDISK WHERE DISKTYPE=harddisk
         20:0    R7K4ND  normal
         20:1    R7LPXD  normal
         20:2    R7P2ND  normal
         20:3    R7ESGD  normal
         20:4    R7H27D  warning - poor performance
         20:5    R7PK9D  normal
         20:6    R7GWJD  normal
         20:7    R7PL2D  normal
         20:8    R7DN1D  normal
         20:9    R7EASD  normal
         20:10   R748SD  normal
         20:11   R6X83D  normal

[root@inblrdrceladm01 ~]# cellcli -e list griddisk attributes name,status,asmmodestatus,asmdeactivationoutcome
         DATA_DR_CD_00_inblrdrceladm01   active                  ONLINE          Yes
         DATA_DR_CD_01_inblrdrceladm01   active                  ONLINE          Yes
         DATA_DR_CD_02_inblrdrceladm01   active                  ONLINE          Yes
         DATA_DR_CD_03_inblrdrceladm01   active                  DROPPED         Yes 
         DATA_DR_CD_04_inblrdrceladm01   proactive failure       DROPPED         Yes
         DATA_DR_CD_05_inblrdrceladm01   active                  ONLINE          Yes
         DBFS_DG_CD_02_inblrdrceladm01   active                  ONLINE          Yes
         DBFS_DG_CD_03_inblrdrceladm01   active                  DROPPED         Yes
         DBFS_DG_CD_04_inblrdrceladm01   proactive failure       DROPPED         Yes
         DBFS_DG_CD_05_inblrdrceladm01   active                  ONLINE          Yes
         RECO_DR_CD_00_inblrdrceladm01   active                  ONLINE          Yes
         RECO_DR_CD_01_inblrdrceladm01   active                  ONLINE          Yes
         RECO_DR_CD_02_inblrdrceladm01   active                  ONLINE          Yes
         RECO_DR_CD_03_inblrdrceladm01   active                  DROPPED         Yes
         RECO_DR_CD_04_inblrdrceladm01   proactive failure       DROPPED         Yes
         RECO_DR_CD_05_inblrdrceladm01   active                  ONLINE          Yes

CellCLI> list cell detail

         cellsrvStatus:          stopped
         msStatus:               running
         rsStatus:               running

cellsrvStatus was stopping automatically on 1st node due to this reason.

Cause, we found it after raising a SR:

We were hitting Bug:17021128 : NIWOT "CHIP PAUSED" CAUSES HIGH SERVICE TIME ON ALL DRIVES 

Storage Servers where LSI MegaRaid firmware is below 12.12.0-0178. This has been observed primarily on systems running Exadata Storage Software 11.2.3.2.0, 11.2.3.2.1 where LSI MegaRaid firmware is 12.1.2.0-0140. I have identified that you are on this version of firmware. 

Further evidence of this are the "Chip 0 Pause" messages in the MegaCli firmware log which I found during my investigation: 

06/26/14 8:34:06: [9e]= 1 [a0]= f [a2]= 9 ^M 
06/26/14 8:37:59: DM: Chip 0 Paused^M 
06/26/14 8:37:59: Chip <0> Slots: Cur=[133]^M 
06/26/14 8:37:59: [87]= 3 [89]= b [8b]= f [8e]= f [90]=14 ^M 
06/26/14 8:41:11: DM: Chip 0 Paused^M 
06/26/14 8:41:11: Chip <0> Slots: Cur=[69]^M 
06/26/14 8:41:11: [47]= 1 [4a]= c [4c]=17 [4e]= e [50]= e ^M 
06/26/14 8:41:16: DM: Chip 0 Paused^M 
06/26/14 8:41:16: Chip <0> Slots: Cur=[74]^M 
06/26/14 8:41:16: [4c]= 1 [4e]= e [50]= e ^M 
06/26/14 8:43:23: DM: Chip 0 Paused^M 
06/26/14 8:43:23: Chip <0> Slots: Cur=[201]^M 

To resolve this issue, the following is recommended: 

Install LSI MegaRaid firmware version 12.12.0-0178 which is included in 11.2.3.2.2 or 11.2.3.3.0

Solution: (Remember this is Rolling method, so no need to down any EBS, Siebel, Hyperion or any other Applications)

1) Upgrade Image of DB Servers from 11.2.3.2.1 to 11.2.3.3.1 in rolling fashion

./dbnodeupdate.sh -u -l /u01/patches/YUM/p18876946_112331_Linux-x86-64.zip -v

./dbnodeupdate.sh -u -l /u01/patches/YUM/p18876946_112331_Linux-x86-64.zip -b (for backup)

./dbnodeupdate.sh -u -l /u01/patches/YUM/p18876946_112331_Linux-x86-64.zip -n (for execution)

./dbnodeupdate.sh -c (after patching and node reboot)

2) Apply Bundle Patch 24 (JUL 2014 - 11.2.0.3.24) On RAC Oracle Homes (manual method)

On ASM/Grid Home first, followed by on rest RDBMS RAC Homes. (We have 11 RDBMS Homes for different Applications)

Stop agents (if running)

Check & Rollback for conflicting patches, if any

Run preprepatch script

Apply QUARTERLY CRS PATCH FOR EXADATA (JUL 2014 - 11.2.0.3.24) (18906063) & QUARTERLY DATABASE PATCH FOR EXADATA (JUL 2014 - 11.2.0.3.24) : (18707883 - this require only on DB Home and not on Grid Home) 

Run Postpatch Script

Apply conflicting patch

start CRS with rootcrs.pl -patch

crsctl check crs

Start Both EM Agents

crsctl stat res -t 

3) Run Catbundle in above patch applied RAC Databases.

su - oracle
. oraenv 
export ORACLE_SID=<instance_name>
cd $ORACLE_HOME
sqlplus "/ as sysdba"

Check for invalid objects
======================================
column comp_name format a40
column version format a12
column status format a15
select comp_name,version,status from dba_registry;
---------------------------------------------------
column owner format a15
column object_name format a40
column object_type format a20
select owner, object_name, object_type from dba_objects where status='INVALID' order by object_type,owner,object_name;

If there are a lot of invalids, this next command will list only the invalids containing SYS in the owner
-----------------------------------------------------------------------------------------------------------
select owner, object_name, object_type from dba_objects where status='INVALID' and owner like '%SYS%' order by object_type,owner,object_name;

---------------------------------------------------
@?/rdbms/admin/utlprp.sql 16
select comp_name,version,status from dba_registry;

------------------------------------------------
select capture_name from dba_capture where capture_name not like 'OGG$%';

select apply_name from dba_apply where apply_name not like  'OGG$%';

-------------------------------------------------------
select capture_name from dba_capture where capture_name not like 'OGG$%';
exec dbms_capture_adm.stop_capture('capture_name');

select apply_name from dba_apply where apply_name not like  'OGG$%';
exec dbms_apply_adm.stop_apply('apply_name');
-------------------------------------------------------------
@?/rdbms/admin/catbundle.sql exa apply

@?/rdbms/admin/utlprp.sql 16

-------------------------------------------------------------
set lines 200
column owner format a15
column object_name format a40
column object_type format a20
col comp_name for a60
select comp_name,version,status from dba_registry;

select owner, object_name, object_type from dba_objects where status='INVALID' order by object_type,owner,object_name;

If there are a lot of invalids, this next command will list only the invalids containing SYS in the owner

select owner, object_name, object_type from dba_objects where status='INVALID' and owner like '%SYS%' order by object_type,owner,object_name;

-----------------------------------------------------------
select capture_name from dba_capture where capture_name not like 'OGG$%';
exec dbms_capture_adm.start_capture('capture_name');

select apply_name from dba_apply where apply_name not like  'OGG$%';
exec dbms_apply_adm.start_apply('apply_name');

--------------------------------------------------------------
Check that the apply finished successfully

set lines 200
col ACTION_TIME for a40
col COMMENTS for a40
select * from dba_registry_history;

4) Upgrade Image of Cell (Storage) Servers from 11.2.3.2.1 to 11.2.3.3.1 in rolling fashion

cd /u01/patches/CELL/patch_11.2.3.3.1.140708
#./patchmgr -cells cell_group -patch_check_prereq -rolling

The output should be cleaned w.r.t above command for each Cell Node.

Check repair times for all mounted disk groups in the Oracle ASM instance and adjust if needed
========================================================================
su - oracle
. oraenv <<EOF 
+ASM1
EOF
sqlplus / as sysasm
select dg.name,a.value from v$asm_diskgroup dg, v$asm_attribute a where dg.group_number=a.group_number and a.name='disk_repair_time';

If the repair time is not 3.6 hours then note the value and the diskgroup names. Replace <diskgroup_name> in the following statement to adjust.

alter diskgroup '<diskgroup_name>' set attribute 'disk_repair_time'='3.6h';  ### Set it to Higher Side

Repeat the above statement for each diskgroup

2) Increase ASM Power with Limit asm_power_limit parameter
3) Check no V$asm_operation is currently going before starting Cell Patching Activity

Cell Patching in Rolling Upgrade  (Initiate from DB Node, root user)
========================================================================

[[oracle@inblrdrdbadm01 patch_11.2.3.3.1.140708]$ pwd
/u01/patches/CELL/patch_11.2.3.3.1.140708

[oracle@inblrdrdbadm01 patch_11.2.3.3.1.140708]$ cat cell_group
inblrdrceladm01
inblrdrceladm02
inblrdrceladm03

[oracle@inblrdrdbadm01 patch_11.2.3.3.1.140708]$ cat dbs_group
inblrdrdbadm01
inblrdrdbadm02

[oracle@inblrdrdbadm01 patch_11.2.3.3.1.140708]$ cat inblrdrceladm01
inblrdrceladm01

[oracle@inblrdrdbadm01 patch_11.2.3.3.1.140708]$ cat inblrdrceladm02
inblrdrceladm02

[oracle@inblrdrdbadm01 patch_11.2.3.3.1.140708]$ cat inblrdrceladm03
inblrdrceladm03

Cleanup space from any previous runs
==================================================
the -reset_force command is only done the first time the cells are patched to this release. 
It is not necessary to use the command for subsequent cell patching, even after rolling back the patch.

#./patchmgr -cells cell_group -reset_force   (cell_group consist of cell servers hostname, or you can give single hostname file name)

OR

#./patchmgr -cells inblrdrceladm01 -reset_force  (Same way inblrdrceladm02 /inblrdrceladm03 file)

-------------------------------------------------------------------------------------------------
Always use the -cleanup option before retrying a failed or halted run of the patchmgr utility.

#./patchmgr -cells cell_group -cleanup

OR

#./patchmgr -cells inblrdrceladm01 -cleanup    (Same way inblrdrceladm02 /inblrdrceladm03 file)

Run prerequisites check   (The output Should be Clean)
=================================================================
cd /u01/patches/CELL/patch_11.2.3.3.1.140708
#./patchmgr -cells cell_group -patch_check_prereq -rolling

OR

#./patchmgr -cells inblrdrceladm01 -patch_check_prereq -rolling    (Same way inblrdrceladm02 /inblrdrceladm03 file)

Patch the cell nodes (in rolling upgrade)
===========================================

# nohup ./patchmgr -cells  inblrdrceladm01 -patch -rolling &  [Same way for inblrdrceladm02 /inblrdrceladm03 file, only after checking #cellcli -e list griddisk attributes name,status,asmmodestatus,asmdeactivationoutcome should be ONLINE (Not Resyncing) on Cell Node]

SUCCESS: DONE: Execute plugin check for Patch Check Prereq.
1 of 5 :Working: DO: Initiate patch on cells. Cells will remain up. Up to 5 minutes ...
2 of 5 :Working: DO: Waiting to finish pre-reboot patch actions. Cells will remain up. Up to 45 minutes
3-5 of 5 :Working: DO: Finalize patch and check final status on cells. Cells will reboot.

Monitor the patch progress
===================================
cd /u01/patches/CELL/patch_11.2.3.3.1.140708
tail -f nohup.out

Cleanup space
==================
#./patchmgr -cells cell_group -cleanup

OR

#./patchmgr -cells inblrdrceladm01 -cleanup    (Same way inblrdrceladm02 /inblrdrceladm03 file)

Post Checks
=================
#imageinfo -version       
#imageinfo -status       
#uname -r     
#imagehistory
#uptime

#dcli -l root -g /opt/oracle.SupportTools/onecommand/cell_group cellcli -e list griddisk attributes name,status,asmmodestatus,asmdeactivationoutcome|more

(The next 5 lines are all one command and should not return any output. If output is returned then disks are still resyncing.)

dcli -g cell_group -l root \
"cat /root/attempted_deactivated_by_patch_griddisks.txt | grep -v \
ACTIVATE | while read line; do str=\`cellcli -e list griddisk where \
name = \$line attributes name, status, asmmodestatus\`; echo \$str | \
grep -v \"active ONLINE\"; done" 

so Don't Start another Cell Patching in case if you see Disks are Resyncing, All Disks should be ONLINE)

Change disk_repair_time back to original value 
==================================================================
su - oracle
. oraenv <<EOF 
+ASM1
EOF
sqlplus / as sysasm
select dg.name,a.value from v$asm_diskgroup dg, v$asm_attribute a where dg.group_number=a.group_number and a.name='disk_repair_time';
If the repair time is not 3.6 hours then note the value and the diskgroup names. Replace <diskgroup_name> in the following statement to adjust.
alter diskgroup '<diskgroup_name>' set attribute 'disk_repair_time'='<original value>';
Repeat the above statement for each diskgroup
exit

5) Apply patch for Infiniband (IB) Switch

cd /u01/patches/CELL/patch_11.2.3.3.1.140708

vi ibswitches.lst

One switch per line. Spine switch listed first as below:
inblrdrsw-iba0
inblrdrsw-ibb0

./patchmgr -ibswitches ibswitches.lst -upgrade -ibswitch_precheck (Pre-requisites Check)

./patchmgr -ibswitches ibswitches.lst -upgrade (Actual Upgrade)


The output should show SUCCESS . If there are errors, then correct the errors and run the upgrade command again.

############################ Post Activities and Checks #############################

CellCLI> ALTER PHYSICALDISK 20:4 reenable force; 
Physical disk 20:4 was reenabled. 

CellCLI> LIST PHYSICALDISK WHERE DISKTYPE=harddisk 
20:0 R7K4ND normal 
20:1 R7LPXD normal 
20:2 R7P2ND normal 
20:3 R7ESGD normal 
20:4 R7H27D normal    --- issue resolved
20:5 R7PK9D normal 
20:6 R7GWJD normal 
20:7 R7PL2D normal 
20:8 R7DN1D normal 
20:9 R7EASD normal 
20:10 R748SD normal 
20:11 R6X83D normal 

Run the below command on ASM1 

SQL> alter diskgroup DATA_DR add disk 'o/192.168.10.3/DATA_DR_CD_04_inblrdrceladm01' force,'o/192.168.10.3/DATA_DR_CD_03_inblrdrceladm01' force; 

Run the below command on ASM2 

SQL> alter diskgroup DBFS_DG add disk 'o/192.168.10.3/DBFS_DG_CD_04_inblrdrceladm01' force,'o/192.168.10.3/DBFS_DG_CD_03_inblrdrceladm01' force; 
SQL> alter diskgroup RECO_DR add disk 'o/192.168.10.3/RECO_DR_CD_03_inblrdrceladm01' force,'o/192.168.10.3/RECO_DR_CD_04_inblrdrceladm01' force; 

[oracle@inblrdrdbadm01 ~]$ . ASM.env
[oracle@inblrdrdbadm01 ~]$
[oracle@inblrdrdbadm01 ~]$ sqlplus "/as sysasm"

SQL> alter diskgroup DATA_DR add disk 'o/192.168.10.3/DATA_DR_CD_04_inblrdrceladm01' force,'o/192.168.10.3/DATA_DR_CD_03_inblrdrceladm01' force;

Diskgroup altered.

SQL> exit

[oracle@inblrdrdbadm01 ~]$ ssh inblrdrdbadm02
Last login: Thu Aug 14 11:13:52 2014 from inblrdrdbadm01.tajhotels.com
[oracle@inblrdrdbadm02 ~]$ . ASM.env
[oracle@inblrdrdbadm02 ~]$ sqlplus "/as sysasm"

SQL*Plus: Release 11.2.0.3.0 Production on Thu Aug 14 19:24:13 2014

Copyright (c) 1982, 2011, Oracle.  All rights reserved.


Connected to:
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Real Application Clusters and Automatic Storage Management options

SQL> alter diskgroup DBFS_DG add disk 'o/192.168.10.3/DBFS_DG_CD_04_inblrdrceladm01' force,'o/192.168.10.3/DBFS_DG_CD_03_inblrdrceladm01' force;

Diskgroup altered.

SQL> alter diskgroup RECO_DR add disk 'o/192.168.10.3/RECO_DR_CD_03_inblrdrceladm01' force,'o/192.168.10.3/RECO_DR_CD_04_inblrdrceladm01' force;

Diskgroup altered.

[root@inblrdrceladm01 ~]# cellcli -e list griddisk attributes name,status,asmmodestatus,asmdeactivationoutcome
         DATA_DR_CD_00_inblrdrceladm01   active  ONLINE  Yes
         DATA_DR_CD_01_inblrdrceladm01   active  ONLINE  Yes
         DATA_DR_CD_02_inblrdrceladm01   active  ONLINE  Yes
         DATA_DR_CD_03_inblrdrceladm01   active  ONLINE  Yes
         DATA_DR_CD_04_inblrdrceladm01   active  ONLINE  Yes
         DATA_DR_CD_05_inblrdrceladm01   active  ONLINE  Yes
         DBFS_DG_CD_02_inblrdrceladm01   active  ONLINE  Yes
         DBFS_DG_CD_03_inblrdrceladm01   active  ONLINE  Yes
         DBFS_DG_CD_04_inblrdrceladm01   active  ONLINE  Yes
         DBFS_DG_CD_05_inblrdrceladm01   active  ONLINE  Yes
         RECO_DR_CD_00_inblrdrceladm01   active  ONLINE  Yes
         RECO_DR_CD_01_inblrdrceladm01   active  ONLINE  Yes
         RECO_DR_CD_02_inblrdrceladm01   active  ONLINE  Yes
         RECO_DR_CD_03_inblrdrceladm01   active  ONLINE  Yes
         RECO_DR_CD_04_inblrdrceladm01   active  ONLINE  Yes
         RECO_DR_CD_05_inblrdrceladm01   active  ONLINE  Yes

# imageinfo

Kernel version: 2.6.39-400.128.17.el5uek #1 SMP Tue May 27 13:20:24 PDT 2014 x86_64
Cell version: OSS_11.2.3.3.1_LINUX.X64_140708
Cell rpm version: cell-11.2.3.3.1_LINUX.X64_140708-1

Active image version: 11.2.3.3.1.140708
Active image activated: 2014-08-14 13:03:50 +0530
Active image status: success
Active system partition on device: /dev/md6
Active software partition on device: /dev/md8

# imagehistory
Version                              : 11.2.3.2.1.130109
Image activation date                : 2013-09-24 13:49:36 +0530
Imaging mode                         : fresh
Imaging status                       : success

Version                              : 11.2.3.3.1.140708
Image activation date                : 2014-08-14 13:03:50 +0530
Imaging mode                         : out of partition upgrade
Imaging status                       : success


Finally, This Marathon Activity got Successfully Completed in approx. 18.5 Hours due to Rolling Fashion & had 11 RDBMS Home for BP24 Patching followed by catbundle to run OR Else in Non-Rolling Fashion it would have taken 1/3rd i.e. 6 Hours approx. but again at the cost of downtime, which wasn't possible. :)

Thanks & Have a Happy Reading - Manish

No comments:

Post a Comment