Technology Stack: Oracle Exadata Patching

Recently, I've got a chance to perform Exadata Patching Activity on our X3-2 Quarter Exadata Box (2 Compute Nodes + 3 Cell Nodes) in co-ordination with Oracle, which consist of,

1) Upgrade Image of DB Servers from 11.2.3.2.1 to 11.2.3.3.1 in rolling fashion, which is most up-to-date at this moment.

2) Apply Bundle Patch 24 (JUL 2014 - 11.2.0.3.24) for QCPE & QDPE On RAC Oracle Homes

Patch description: "QUARTERLY CRS PATCH FOR EXADATA (JUL 2014 - 11.2.0.3.24) (18906063)"
Patch description: "QUARTERLY DATABASE PATCH FOR EXADATA (JUL 2014 - 11.2.0.3.24) : (18707883)"

3) Run Catbundle in above patch applied RAC Databases.

4) Upgrade Image of Cell (Storage) Servers from 11.2.3.2.1 to 11.2.3.3.1 in rolling fashion

5) Apply patch for Infiniband (IB) Switch

Obviously above these 5 steps took lot of planning & pre-requisites checks.
The reason, we went for Image Upgrade because we hit on following bug which is resolved in this 11.2.3.3.1 Image.

Our one of the disks was showing below status,

Issue:

CellCLI> LIST PHYSICALDISK WHERE DISKTYPE=harddisk
20:0 R7K4ND normal
20:1 R7LPXD normal
20:2 R7P2ND normal
20:3 R7ESGD normal
20:4 R7H27D warning - poor performance
20:5 R7PK9D normal
20:6 R7GWJD normal
20:7 R7PL2D normal
20:8 R7DN1D normal
20:9 R7EASD normal
20:10 R748SD normal
20:11 R6X83D normal

[root@inblrdrceladm01 ~]# cellcli -e list griddisk attributes name,status,asmmodestatus,asmdeactivationoutcome
DATA_DR_CD_00_inblrdrceladm01 active ONLINE Yes
DATA_DR_CD_01_inblrdrceladm01 active ONLINE Yes
DATA_DR_CD_02_inblrdrceladm01 active ONLINE Yes
DATA_DR_CD_03_inblrdrceladm01 active DROPPED Yes
DATA_DR_CD_04_inblrdrceladm01 proactive failure DROPPED Yes
DATA_DR_CD_05_inblrdrceladm01 active ONLINE Yes
DBFS_DG_CD_02_inblrdrceladm01 active ONLINE Yes
DBFS_DG_CD_03_inblrdrceladm01 active DROPPED Yes
DBFS_DG_CD_04_inblrdrceladm01 proactive failure DROPPED Yes
DBFS_DG_CD_05_inblrdrceladm01 active ONLINE Yes
RECO_DR_CD_00_inblrdrceladm01 active ONLINE Yes
RECO_DR_CD_01_inblrdrceladm01 active ONLINE Yes
RECO_DR_CD_02_inblrdrceladm01 active ONLINE Yes
RECO_DR_CD_03_inblrdrceladm01 active DROPPED Yes
RECO_DR_CD_04_inblrdrceladm01 proactive failure DROPPED Yes
RECO_DR_CD_05_inblrdrceladm01 active ONLINE Yes

CellCLI> list cell detail

cellsrvStatus: stopped
msStatus: running
rsStatus: running

cellsrvStatus was stopping automatically on 1st node due to this reason.

Cause, we found it after raising a SR:

We were hitting Bug:17021128 : NIWOT "CHIP PAUSED" CAUSES HIGH SERVICE TIME ON ALL DRIVES

Storage Servers where LSI MegaRaid firmware is below 12.12.0-0178. This has been observed primarily on systems running Exadata Storage Software 11.2.3.2.0, 11.2.3.2.1 where LSI MegaRaid firmware is 12.1.2.0-0140. I have identified that you are on this version of firmware.

Further evidence of this are the "Chip 0 Pause" messages in the MegaCli firmware log which I found during my investigation:

06/26/14 8:34:06: [9e]= 1 [a0]= f [a2]= 9 ^M

06/26/14 8:37:59: DM: Chip 0 Paused^M

06/26/14 8:37:59: Chip <0> Slots: Cur=[133]^M

06/26/14 8:37:59: [87]= 3 [89]= b [8b]= f [8e]= f [90]=14 ^M

06/26/14 8:41:11: DM: Chip 0 Paused^M

06/26/14 8:41:11: Chip <0> Slots: Cur=[69]^M

06/26/14 8:41:11: [47]= 1 [4a]= c [4c]=17 [4e]= e [50]= e ^M

06/26/14 8:41:16: DM: Chip 0 Paused^M

06/26/14 8:41:16: Chip <0> Slots: Cur=[74]^M

06/26/14 8:41:16: [4c]= 1 [4e]= e [50]= e ^M

06/26/14 8:43:23: DM: Chip 0 Paused^M

06/26/14 8:43:23: Chip <0> Slots: Cur=[201]^M

To resolve this issue, the following is recommended:

Install LSI MegaRaid firmware version 12.12.0-0178 which is included in 11.2.3.2.2 or 11.2.3.3.0

Solution: (Remember this is Rolling method, so no need to down any EBS, Siebel, Hyperion or any other Applications)

1) Upgrade Image of DB Servers from 11.2.3.2.1 to 11.2.3.3.1 in rolling fashion

./dbnodeupdate.sh -u -l /u01/patches/YUM/p18876946_112331_Linux-x86-64.zip -v

./dbnodeupdate.sh -u -l /u01/patches/YUM/p18876946_112331_Linux-x86-64.zip -b (for backup)

./dbnodeupdate.sh -u -l /u01/patches/YUM/p18876946_112331_Linux-x86-64.zip -n (for execution)

./dbnodeupdate.sh -c (after patching and node reboot)

2) Apply Bundle Patch 24 (JUL 2014 - 11.2.0.3.24) On RAC Oracle Homes (manual method)

On ASM/Grid Home first, followed by on rest RDBMS RAC Homes. (We have 11 RDBMS Homes for different Applications)

Stop agents (if running)

Check & Rollback for conflicting patches, if any

Run preprepatch script

Apply QUARTERLY CRS PATCH FOR EXADATA (JUL 2014 - 11.2.0.3.24) (18906063) & QUARTERLY DATABASE PATCH FOR EXADATA (JUL 2014 - 11.2.0.3.24) : (18707883 - this require only on DB Home and not on Grid Home)

Run Postpatch Script

Apply conflicting patch

start CRS with rootcrs.pl -patch

crsctl check crs

Start Both EM Agents

crsctl stat res -t

3) Run Catbundle in above patch applied RAC Databases.

su - oracle

. oraenv

export ORACLE_SID=<instance_name>

cd $ORACLE_HOME

sqlplus "/ as sysdba"

Check for invalid objects

======================================

column comp_name format a40

column version format a12

column status format a15

select comp_name,version,status from dba_registry;

---------------------------------------------------

column owner format a15

column object_name format a40

column object_type format a20

select owner, object_name, object_type from dba_objects where status='INVALID' order by object_type,owner,object_name;

If there are a lot of invalids, this next command will list only the invalids containing SYS in the owner

-----------------------------------------------------------------------------------------------------------

select owner, object_name, object_type from dba_objects where status='INVALID' and owner like '%SYS%' order by object_type,owner,object_name;

---------------------------------------------------

@?/rdbms/admin/utlprp.sql 16

select comp_name,version,status from dba_registry;

------------------------------------------------

select capture_name from dba_capture where capture_name not like 'OGG$%';

select apply_name from dba_apply where apply_name not like 'OGG$%';

-------------------------------------------------------

select capture_name from dba_capture where capture_name not like 'OGG$%';

exec dbms_capture_adm.stop_capture('capture_name');

select apply_name from dba_apply where apply_name not like 'OGG$%';

exec dbms_apply_adm.stop_apply('apply_name');

-------------------------------------------------------------

@?/rdbms/admin/catbundle.sql exa apply

@?/rdbms/admin/utlprp.sql 16

-------------------------------------------------------------

set lines 200

column owner format a15

column object_name format a40

column object_type format a20

col comp_name for a60

select comp_name,version,status from dba_registry;

select owner, object_name, object_type from dba_objects where status='INVALID' order by object_type,owner,object_name;

If there are a lot of invalids, this next command will list only the invalids containing SYS in the owner

select owner, object_name, object_type from dba_objects where status='INVALID' and owner like '%SYS%' order by object_type,owner,object_name;

-----------------------------------------------------------

select capture_name from dba_capture where capture_name not like 'OGG$%';

exec dbms_capture_adm.start_capture('capture_name');

select apply_name from dba_apply where apply_name not like 'OGG$%';

exec dbms_apply_adm.start_apply('apply_name');

--------------------------------------------------------------

Check that the apply finished successfully

set lines 200

col ACTION_TIME for a40

col COMMENTS for a40

select * from dba_registry_history;

4) Upgrade Image of Cell (Storage) Servers from 11.2.3.2.1 to 11.2.3.3.1 in rolling fashion

cd /u01/patches/CELL/patch_11.2.3.3.1.140708

#./patchmgr -cells cell_group -patch_check_prereq -rolling

The output should be cleaned w.r.t above command for each Cell Node.

Check repair times for all mounted disk groups in the Oracle ASM instance and adjust if needed

========================================================================

su - oracle

. oraenv <<EOF

+ASM1

EOF

sqlplus / as sysasm

select dg.name,a.value from v$asm_diskgroup dg, v$asm_attribute a where dg.group_number=a.group_number and a.name='disk_repair_time';

If the repair time is not 3.6 hours then note the value and the diskgroup names. Replace <diskgroup_name> in the following statement to adjust.

alter diskgroup '<diskgroup_name>' set attribute 'disk_repair_time'='3.6h'; ### Set it to Higher Side

Repeat the above statement for each diskgroup

2) Increase ASM Power with Limit asm_power_limit parameter

3) Check no V$asm_operation is currently going before starting Cell Patching Activity

Cell Patching in Rolling Upgrade (Initiate from DB Node, root user)

========================================================================

[[oracle@inblrdrdbadm01 patch_11.2.3.3.1.140708]$ pwd

/u01/patches/CELL/patch_11.2.3.3.1.140708

[oracle@inblrdrdbadm01 patch_11.2.3.3.1.140708]$ cat cell_group

inblrdrceladm01

inblrdrceladm02

inblrdrceladm03

[oracle@inblrdrdbadm01 patch_11.2.3.3.1.140708]$ cat dbs_group

inblrdrdbadm01

inblrdrdbadm02

[oracle@inblrdrdbadm01 patch_11.2.3.3.1.140708]$ cat inblrdrceladm01

inblrdrceladm01

[oracle@inblrdrdbadm01 patch_11.2.3.3.1.140708]$ cat inblrdrceladm02

inblrdrceladm02

[oracle@inblrdrdbadm01 patch_11.2.3.3.1.140708]$ cat inblrdrceladm03

inblrdrceladm03

Cleanup space from any previous runs

==================================================

the -reset_force command is only done the first time the cells are patched to this release.

It is not necessary to use the command for subsequent cell patching, even after rolling back the patch.

#./patchmgr -cells cell_group -reset_force (cell_group consist of cell servers hostname, or you can give single hostname file name)

#./patchmgr -cells inblrdrceladm01 -reset_force (Same way inblrdrceladm02 /inblrdrceladm03 file)

-------------------------------------------------------------------------------------------------

Always use the -cleanup option before retrying a failed or halted run of the patchmgr utility.

#./patchmgr -cells cell_group -cleanup

#./patchmgr -cells inblrdrceladm01 -cleanup (Same way inblrdrceladm02 /inblrdrceladm03 file)

Run prerequisites check (The output Should be Clean)

=================================================================

cd /u01/patches/CELL/patch_11.2.3.3.1.140708

#./patchmgr -cells cell_group -patch_check_prereq -rolling

#./patchmgr -cells inblrdrceladm01 -patch_check_prereq -rolling (Same way inblrdrceladm02 /inblrdrceladm03 file)

Patch the cell nodes (in rolling upgrade)

===========================================

# nohup ./patchmgr -cells inblrdrceladm01 -patch -rolling & [Same way for inblrdrceladm02 /inblrdrceladm03 file, only after checking #cellcli -e list griddisk attributes name,status,asmmodestatus,asmdeactivationoutcome should be ONLINE (Not Resyncing) on Cell Node]

SUCCESS: DONE: Execute plugin check for Patch Check Prereq.

1 of 5 :Working: DO: Initiate patch on cells. Cells will remain up. Up to 5 minutes ...

2 of 5 :Working: DO: Waiting to finish pre-reboot patch actions. Cells will remain up. Up to 45 minutes

3-5 of 5 :Working: DO: Finalize patch and check final status on cells. Cells will reboot.

Monitor the patch progress

===================================

cd /u01/patches/CELL/patch_11.2.3.3.1.140708

tail -f nohup.out

Cleanup space

==================

#./patchmgr -cells cell_group -cleanup

#./patchmgr -cells inblrdrceladm01 -cleanup (Same way inblrdrceladm02 /inblrdrceladm03 file)

Post Checks

=================

#imageinfo -version

#imageinfo -status

#uname -r

#imagehistory

#uptime

#dcli -l root -g /opt/oracle.SupportTools/onecommand/cell_group cellcli -e list griddisk attributes name,status,asmmodestatus,asmdeactivationoutcome|more

(The next 5 lines are all one command and should not return any output. If output is returned then disks are still resyncing.)

dcli -g cell_group -l root \

"cat /root/attempted_deactivated_by_patch_griddisks.txt | grep -v \

ACTIVATE | while read line; do str=\`cellcli -e list griddisk where \

name = \$line attributes name, status, asmmodestatus\`; echo \$str | \

grep -v \"active ONLINE\"; done"

so Don't Start another Cell Patching in case if you see Disks are Resyncing, All Disks should be ONLINE)

Change disk_repair_time back to original value

==================================================================

su - oracle

. oraenv <<EOF

+ASM1

EOF

sqlplus / as sysasm

select dg.name,a.value from v$asm_diskgroup dg, v$asm_attribute a where dg.group_number=a.group_number and a.name='disk_repair_time';

If the repair time is not 3.6 hours then note the value and the diskgroup names. Replace <diskgroup_name> in the following statement to adjust.

alter diskgroup '<diskgroup_name>' set attribute 'disk_repair_time'='<original value>';

Repeat the above statement for each diskgroup

exit

5) Apply patch for Infiniband (IB) Switch

cd /u01/patches/CELL/patch_11.2.3.3.1.140708

vi ibswitches.lst

One switch per line. Spine switch listed first as below:

inblrdrsw-iba0

inblrdrsw-ibb0

./patchmgr -ibswitches ibswitches.lst -upgrade -ibswitch_precheck (Pre-requisites Check)

./patchmgr -ibswitches ibswitches.lst -upgrade (Actual Upgrade)

The output should show SUCCESS . If there are errors, then correct the errors and run the upgrade command again.

############################ Post Activities and Checks #############################

CellCLI> ALTER PHYSICALDISK 20:4 reenable force;

Physical disk 20:4 was reenabled.

CellCLI> LIST PHYSICALDISK WHERE DISKTYPE=harddisk

20:0 R7K4ND normal

20:1 R7LPXD normal

20:2 R7P2ND normal

20:3 R7ESGD normal

20:4 R7H27D normal --- issue resolved

20:5 R7PK9D normal

20:6 R7GWJD normal

20:7 R7PL2D normal

20:8 R7DN1D normal

20:9 R7EASD normal

20:10 R748SD normal

20:11 R6X83D normal

Run the below command on ASM1

SQL> alter diskgroup DATA_DR add disk 'o/192.168.10.3/DATA_DR_CD_04_inblrdrceladm01' force,'o/192.168.10.3/DATA_DR_CD_03_inblrdrceladm01' force;

Run the below command on ASM2

SQL> alter diskgroup DBFS_DG add disk 'o/192.168.10.3/DBFS_DG_CD_04_inblrdrceladm01' force,'o/192.168.10.3/DBFS_DG_CD_03_inblrdrceladm01' force;

SQL> alter diskgroup RECO_DR add disk 'o/192.168.10.3/RECO_DR_CD_03_inblrdrceladm01' force,'o/192.168.10.3/RECO_DR_CD_04_inblrdrceladm01' force;

[oracle@inblrdrdbadm01 ~]$ . ASM.env

[oracle@inblrdrdbadm01 ~]$

[oracle@inblrdrdbadm01 ~]$ sqlplus "/as sysasm"

SQL> alter diskgroup DATA_DR add disk 'o/192.168.10.3/DATA_DR_CD_04_inblrdrceladm01' force,'o/192.168.10.3/DATA_DR_CD_03_inblrdrceladm01' force;

Diskgroup altered.

SQL> exit

[oracle@inblrdrdbadm01 ~]$ ssh inblrdrdbadm02

Last login: Thu Aug 14 11:13:52 2014 from inblrdrdbadm01.tajhotels.com

[oracle@inblrdrdbadm02 ~]$ . ASM.env

[oracle@inblrdrdbadm02 ~]$ sqlplus "/as sysasm"

SQL*Plus: Release 11.2.0.3.0 Production on Thu Aug 14 19:24:13 2014

Connected to:

Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production

With the Real Application Clusters and Automatic Storage Management options

SQL> alter diskgroup DBFS_DG add disk 'o/192.168.10.3/DBFS_DG_CD_04_inblrdrceladm01' force,'o/192.168.10.3/DBFS_DG_CD_03_inblrdrceladm01' force;

Diskgroup altered.

SQL> alter diskgroup RECO_DR add disk 'o/192.168.10.3/RECO_DR_CD_03_inblrdrceladm01' force,'o/192.168.10.3/RECO_DR_CD_04_inblrdrceladm01' force;

Diskgroup altered.

[root@inblrdrceladm01 ~]# cellcli -e list griddisk attributes name,status,asmmodestatus,asmdeactivationoutcome

DATA_DR_CD_00_inblrdrceladm01 active ONLINE Yes

DATA_DR_CD_01_inblrdrceladm01 active ONLINE Yes

DATA_DR_CD_02_inblrdrceladm01 active ONLINE Yes

DATA_DR_CD_03_inblrdrceladm01 active ONLINE Yes

DATA_DR_CD_04_inblrdrceladm01 active ONLINE Yes

DATA_DR_CD_05_inblrdrceladm01 active ONLINE Yes

DBFS_DG_CD_02_inblrdrceladm01 active ONLINE Yes

DBFS_DG_CD_03_inblrdrceladm01 active ONLINE Yes

DBFS_DG_CD_04_inblrdrceladm01 active ONLINE Yes

DBFS_DG_CD_05_inblrdrceladm01 active ONLINE Yes

RECO_DR_CD_00_inblrdrceladm01 active ONLINE Yes

RECO_DR_CD_01_inblrdrceladm01 active ONLINE Yes

RECO_DR_CD_02_inblrdrceladm01 active ONLINE Yes

RECO_DR_CD_03_inblrdrceladm01 active ONLINE Yes

RECO_DR_CD_04_inblrdrceladm01 active ONLINE Yes

RECO_DR_CD_05_inblrdrceladm01 active ONLINE Yes

# imageinfo

Kernel version: 2.6.39-400.128.17.el5uek #1 SMP Tue May 27 13:20:24 PDT 2014 x86_64

Cell version: OSS_11.2.3.3.1_LINUX.X64_140708

Cell rpm version: cell-11.2.3.3.1_LINUX.X64_140708-1

Active image version: 11.2.3.3.1.140708

Active image activated: 2014-08-14 13:03:50 +0530

Active image status: success

Active system partition on device: /dev/md6

Active software partition on device: /dev/md8

# imagehistory

Version : 11.2.3.2.1.130109

Image activation date : 2013-09-24 13:49:36 +0530

Imaging mode : fresh

Imaging status : success

Version : 11.2.3.3.1.140708

Image activation date : 2014-08-14 13:03:50 +0530

Imaging mode : out of partition upgrade

Imaging status : success

Finally, This Marathon Activity got Successfully Completed in approx. 18.5 Hours due to Rolling Fashion & had 11 RDBMS Home for BP24 Patching followed by catbundle to run OR Else in Non-Rolling Fashion it would have taken 1/3rd i.e. 6 Hours approx. but again at the cost of downtime, which wasn't possible. :)

Thanks & Have a Happy Reading - Manish

Technology Stack

Tuesday, September 16, 2014

Oracle Exadata Patching

No comments:

Post a Comment