Saturday, February 14, 2015

Critical Oracle Database Recovery without Archive Logs - Redo Logs / Force Open Database in an Inconsistent State

Recently, we had a situation where we restored Production backup (Database is 11.2.0.3 RAC on Exadata/ASM) of 14th Jan, 2015 through Netvault tape backup due to some major issue in finance to get the status/correction of critical tables  (AP, XLA, GL Schemas) of our eBusiness Suite Database. 

We had successfully restored all datafiles through Netvault tape backup however we found that some archives were missing which were required for recovery due to which we unable to open database in resetlogs.

There was no way to generate required archive logs again or to take incremental backup from the needed SCN from production database to roll forward the clone database until the issue occurred.

But the above feature is still not there in Oracle Database that to take backup from SCN until SCN. We can  only create incremental backups from SCN until current time. (which we used to roll forward standby database using RMAN Incremental Backup)

Although there is an enhancement request to get this feature implemented in newer releases, it is still under consideration: Bug 20327651 CREATING INCREMENTAL BACKUP WITH STARTING SCN AND ENDING SCN


We had no other option but to open database in an inconsistent state. So to open database in inconsistent state, we have followed below steps. (PLEASE REMEMBER THAT THE  INSTRUCTIONS HERE ARE DESTRUCTIVE. YOU ARE STRONGLY ADVISED TO BACKUP THE DATABASE BEFORE PROCEEDING. IF YOU DO NOT DO THIS YOU MAY LOSE THE CHANCE TO TRY OTHER OPTIONS AND I'll NOT BE RESPONSIBLE FOR ANY DAMAGE)

IF THE STEPS BELOW DO ALLOW YOU TO OPEN YOUR DATABASE THEN IT IS ESSENTIAL THAT THE DATABASE BE REBUILT AS IT IS NO LONGER SUPPORTED. FAILURE TO DO SO MAY LEAD TO DATA DICTIONARY INCONSISTENCIES, INTERNAL ERRORS AND CORRUPTIONS. 

Steps to attempt to force the database open:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 


1) Backup the database while the database is closed.

2) Disable this database from EM if running. We also need to disable RAC to avoid Auto restart.

3) If your datafiles are from different points in time, it is best to try to
use system tablespace datafiles at a similar timestamp to the OLDEST files
you have online in the database. This reduces the chance that you will get
problems during the bootstrap phase of opening the database.

4) Edit your init.ora file to change undo_management and add two parameters.

* Change UNDO_MANAGEMENT=AUTO to

UNDO_MANAGEMENT=MANUAL

* Remove or comment out UNDO_TABLESPACE and UNDO_RETENTION.

* Add

CLUSTER_DATABASE=FALSE
JOB_QUEUE_PROCESSES=0
_ALLOW_RESETLOGS_CORRUPTION = TRUE
_CORRUPTED_ROLLBACK_SEGMENTS =(comma separated list of Automatic Undo segments)

Example:

_CORRUPTED_ROLLBACK_SEGMENTS = (_SYSSMU1$, _SYSSMU2$, _SYSSMU3$, _SYSSMU4$,
_SYSSMU5$, _SYSSMU6$, _SYSSMU7$, _SYSSMU8$, _SYSSMU9$, _SYSSMU10$)

Note, sometimes the alert log will tell you what Automatic Undo segments are in
use. Search the alert log for SYSS. If the alert log does not contain that
information then use _SYSSMU1$ through _SYSSMU10$ as shown in the example above.

From 11g onwards the names of the UNDO-segments are a little different and have an extension:

_SYSSMU< nn>_< mmmmm>$) ( example _SYSSMU10_1221075320$)

Be sure to include the additional numbers in the UNDO-segment name as well.

In UNIX you can issue this command on SYSTEM datafile(s) to get the undo segment names:

$ strings system01.dbf | grep _SYSSMU | cut -d $ -f 1 | sort -u

If files are in ASM, you can use RMAN command to copy them to filesystem before using strings command, e.g. :

RMAN> backup as copy datafile 1 format '/tmp/system01.dbf' ;

You should run above command on all SYSTEM datafiles as undo segments may reside in more than one SYSTEM datafile.
To list the name of SYSTEM datafiles, run below command in at least MOUNT mode:

SQL> select file#, f.name from v$datafile f, v$tablespace t where f.ts#=t.ts# and t.name='SYSTEM' order by file# ;

From the output of the strings command above, add a $ to end of each _SYSSMU undo segment name.

* If you only have a spfile available, you can from the closed, nomount or the
mount stage create an init.ora file as follows:

SQL> CREATE PFILE FROM SPFILE;

Do NOT edit the SPFILE.

5) Invoke SQL*Plus, startup mount, check that correct init.ora was used and
all datafiles are in the status of online or system.

$ sqlplus "/as sysdba"

SQL> startup mount pfile = (full path / file name to init.ora)
Confirm that the hidden parameters from step 3 were used:

SQL> show parameters corrupt

You should see both hidden parameters listed. If not, the wrong init.ora
may have been modified. Do not continue until "show parameters corrupt" shows
both hidden parameters.

SQL> show parameters undo

You should see undo management set to manual. If not, the wrong init.ora
may have been modified. Do not continue until "show parameters undo" shows
undo management as manual.

Check that all files you want to open with are listed as ONLINE or as SYSTEM.

SQL> select name, file#, status from v$datafile where status not in
('SYSTEM', 'ONLINE');

If any rows are returned from the query above, bring the file(s) online with:

SQL> ALTER DATABASE DATAFILE file# ONLINE;

and repeat until there are no files in an OFFLINE status. If any file remains or
changes into "recover" status after you try to online the file proceed to step 5.

6) Perform a fake incomplete recovery then open the database with resetlogs.

SQL> recover database using backup controlfile until cancel;

WHEN PROMPTED FOR AN ARCHIVELOG FILE TYPE cancel THEN PRESS ENTER.

SQL> ALTER DATABASE OPEN RESETLOGS;

7) If the database opens try selecting from a table. For example:

SQL> SELECT SYSDATE FROM DUAL; 


With the database open and functional you should attempt to export the database
IMMEDIATELY. Since database is unstable, don't try another shutdown/startup unless needed.
Once you have an export the database MUST be recreated from scratch.
This means dropping and deleting ALL datafiles and creating a new database from
scratch.

A database which has been opened in this way but not rebuilt will not be
supported by Oracle. Any delay in exporting the contents or any attempt to
use the system may cause irreparable damage.

NOTE: BE SURE TO REVERSE / REMOVE THE INIT.ORA PARAMETERS ADDED IN STEP 3
OTHERWISE YOU MAY ACCIDENTALLY CORRUPT ANY NEW DATABASE CREATED USING THE SAME
INIT.ORA FILE. 


In our case, OPEN RESETLOGS was failing with below,

SQL> ALTER DATABASE OPEN RESETLOGS;
ALTER DATABASE OPEN RESETLOGS
*
ERROR at line 1:
ORA-01092: ORACLE instance terminated. Disconnection forced
ORA-00704: bootstrap process failure
ORA-00704: bootstrap process failure
ORA-00604: error occurred at recursive SQL level 1
ORA-01555: snapshot too old: rollback segment number 135 with name
"_SYSSMU135_2708413105$" too small
Process ID: 67764
Session ID: 627 Serial number: 3 


Also RESETLOGS Clause made datafiles header updated and controlfile type from BACKUP to CURRENT, so we had to restore Controlfile from backup again along with CATALOGing all datafiles again.

Database was been forced open. All files were not in sync. File having block scn higher than database scn The issue was due to when forcing open the ora-1555 means one of the undo segments had a higher SCN. Thus needing the _MINIMUM_GIGA_SCN.  

===== Below is content from Metalink Note ID to identify how much to set _MINIMUM_GIGA_SCN ======


If no trace file is available for Ora-1555 please do the following to generate a trace file for 704 And ora-1555 :-

SQL>Startup mount ;

SQL>Alter session set tracefile_identifier=new1555 ;

SQL>Alter session set events '1555 trace name errorstack forever, level 3';

Alternatively you can set the event in the init.ora file (This would need database to be restarted and mounted for parameter to come into effect)

 Event = "1555 trace name errorstack level 3"

Now try open resetlogs

SQL>Alter database open resetlogs ;

It will error out with ORA-1092

Alert log would show 1555 error message.

Go to udump or trace directory

ls -lrt *new1555*

Case one undo segment name is reported in alert log

ORA-01555: snapshot too old: rollback segment number 11 with name "_SYSSMU11$" too small

In the above case the Ora-1555 was reported on _SYSSMU11$ which is undo segment number 11.

Hex value of same is 11 --> b

Search the trace file from Table or index block header dump whose transaction layer has an undo segment 11 been used in the ITL.Scroll up to get the Buffer header dump of that block
BH (0xacff8e48) file#: 1 rdba: 0x0040003e (1/62) class: 1 ba: 0xacf66000
set: 3 blksize: 8192 bsi: 0 set-flg: 2 pwbcnt: 0
dbwrid: 0 obj: 18 objn: 18 tsn: 0 afn: 1
hash: [d0e04c98,d0e04c98] lru: [acff8fd8,acff8db8]
ckptq: [NULL] fileq: [NULL] objq: [cc9a3d00,cc9a3d00]
use: [ce699910,ce699910] wait: [NULL]
st: CR md: EXCL tch: 0
cr: [scn: 0x0.4124e1e],[xid: 0x6.0.c28d],[uba: 0x820075.ea1.23],[cls: 0x0.46b5261],[sfl: 0x1]
flags:
Using State Objects
----------------------------------------
SO: 0xce6998d0, type: 24, owner: 0xd04439e8, flag: INIT/-/-/0x00
(buffer) (CR) PR: 0xd02fa378 FLG: 0x500000
class bit: (nil)
kcbbfbp: [BH: 0xacff8e48, LINK: 0xce699910]
where: kdswh02: kdsgrp, why: 0
buffer tsn: 0 rdba: 0x0040003e (1/62)
scn: 0x0000.046b527a seq: 0x00 flg: 0x00 tail: 0x527a0600
frmt: 0x02 chkval: 0x0000 type: 0x06=trans data
Hex dump of block: st=0, typ_found=1
Dump of memory from 0x00000000ACF66000 to 0x00000000ACF68000
0ACF67FF0 FFFF02C1 01FF8001 02C10280 527A0600 [..............zR]
.
.
.

Block header dump: 0x0040003e
Object id on Block? Y
seg/obj: 0x12 csc: 0x00.46b519e itc: 1 flg: - typ: 1 - DATA
fsl: 0 fnx: 0x0 ver: 0x01

Itl Xid Uba Flag Lck Scn/Fsc
0x01 0x000b.00b.00000e7a 0x00802042.00db.1a C--- 0 scn 0x0000.04624228

So here we see ITL allocated is 0x01.

Transaction identifier --> <undo no>.<slot>.<wrap> ---> 0x000b.00b.00000e7a

Undo segment no ---> 0x000b --> 11 in decimal .

This block belong to 0x0040003e --(1/62)

Find the SCN of this block from the BH(Buffer header) for 0x0040003e --(1/62)

So in this case its the one highlighted above in trace scn: 0x0000.046b527a

Now set _mimimum_giga_scn value based on this SCN

scn: 0x0000.046b527a 

Convert 0x0000 --> Decimal --> 0
Convert 046b527a --> Decimal -->74142330

Combine both these value and find the value for _minimum_giga_scn
Convert --> 074142330 /1024/1024/1024 =0.069050
Add + 2G to the above value and round it up

_minimum_giga_scn = 3  

Set this parameter in the pfile along with the other force open parameter

This parameter has been deprecated starting in CPUJAN2012 which is included in the next PSU versions or above:

10.2.0.4.11
10.2.0.5.6
11.1.0.7.10
11.2.0.2.5
11.2.0.3.1
 
SQL>Startup mount pfile=<> ;

SQL>recover database using backup controlfile until cancel ;
Cancel

SQL>Alter database open resetlogs ;

Try to shutdown and startup DB several times, sometimes the SCN increases in DB bounce are enough to bypass the issue.  ### in case if you still face issue while opening a database.

sql>shutdown; 
sql>startup; 

Do it many times (10-15) and lets see if we can bypass the issue... 
====================================================================

so you might be required to change compatible parameter as well followed by restoring controlfiles / datafiles again.

So key parameters are to force open database in an inconsistent state as below,

_minimum_giga_scn
adjust_scn event or _smu_debug_mode=268435456 (deprecated) ### Not Used
_allow_resetlogs_corruption
_corrupted_rollback_segments
compatible ### If required



If you still face any issue, while doing this critical recovery of database then contact me on my mail id: manish.nashikkar@hotmail.com or give me call on +91 - 9920083054 at support cost.

Thanks,
Manish Nashikkar

Thursday, February 12, 2015

Track and Trace E-Business Suite Concurrent Request - EBS Logs, O/S Process IDs, Locks

I often get asked to take a look at an Oracle eBusiness Suite concurrent request to see what it is doing, this can come from a few different angles:

What requests are currently running?
I have an operating system process that is taking too much CPU - what is it doing?
Can you tell me where the request is at? I've clicked on the log but it doesn't show anything!
My request is taking too long - can you check for blocking locks?
There are a number of strategies to track and trace where things are at for a running request, these include:

Monitor the currently running requests in fnd_concurrent_requests
Checking the v$sqlarea to see what SQL statement or PL/SQL is running
Tailing the concurrent request log / output file while it is being written to near realtime - prior to request completing
Checking for locks blocking the concurrent request

set pages 9999 feed on lines 150
col user_concurrent_program_name format a40 head PROGRAM trunc
col elapsed format 9999
col request_id format 9999999 head REQUEST
col user_name format a12
col oracle_process_id format a5 head OSPID
col inst_name format a10
col sql_text format a30
col outfile_tmp format a30
col logfile_tmp format a30

REM *********************
REM **** RAC VERSION ****
REM *********************
select /*+ ordered */ 
       fcp.user_concurrent_program_name
,      fcr.request_id
,      round(24*60*( sysdate - actual_start_date )) elapsed
,      fu.user_name
,      fcr.oracle_process_id
,      sess.sid
,      sess.serial#
,      inst.inst_name
,      sa.sql_text
,      cp.plsql_dir || '/' || cp.plsql_out outfile_tmp
,      cp.plsql_dir || '/' || cp.plsql_log logfile_tmp
from   apps.fnd_concurrent_requests fcr
,      apps.fnd_concurrent_programs_tl fcp
,      apps.fnd_concurrent_processes cp
,      apps.fnd_user fu
,      gv$process pro
,      gv$session sess
,      gv$sqlarea sa
,      sys.v_$active_instances inst
where  fcp.concurrent_program_id = fcr.concurrent_program_id
and    fcp.application_id = fcr.program_application_id
and    fcr.controlling_manager = cp.concurrent_process_id
and    fcr.requested_by = fu.user_id (+)
and    fcr.oracle_process_id = pro.spid (+)
and    pro.addr = sess.paddr (+)
and    pro.inst_id = sess.inst_id (+)
and    sess.sql_address = sa.address (+)
and    sess.sql_hash_value = sa.hash_value (+)
and    sess.inst_id = inst.inst_number (+)
and    fcr.phase_code = 'R' /* only running requests */
;

PROGRAM                                   REQUEST ELAPSED USER_NAME    OSPID        SID    SERIAL# INST_NAME  SQL_TEXT                       OUTFILE_TMP                    LOGFILE_TMP
---------------------------------------- -------- ------- ------------ ----- ---------- ---------- ---------- ------------------------------ ------------------------------ ------------------------------
Workflow Background Process               2960551       1 TCS-ADMIN     24814        130      29699 APPLPROD1  BEGIN WF_ENGINE.BACKGROUNDCONC /usr/tmp/o0068194.tmp          /usr/tmp/l0068194.tmp
                                                                                                              URRENT(:errbuf,:rc,:A0,:A1,:A2
                                                                                                              ,:A3,:A4,:A5); END;

1 row selected.

From the above we can see key information:

The running Concurrent Request Program Name and Request_ID
The database node operating system process id (OSPID) so we can monitor usage via top / iostat / vmstat
The SID / Serial in case we want to kill the session via alter system kill session '130,29699';
The instance name the database session is running on in a RAC environment
The currently running SQL text
The temporary files where concurrent request log/out is being written to via utl_file while running. These files are copied over to the Concurrent Tier $APPLCSF/$APPLOUT and $APPLLOG after completion of the request.
We can break out the above into a few queries and procedures to drill into specific information information from the core EBS tables and DBA v$ views

1. Get concurrent requests running, temp files and Operating System PID

col user_concurrent_program_name format a40 head PROGRAM trunc
col elapsed format 9999
col request_id format 9999999 head REQUEST
col user_name format a12
col oracle_process_id format a5 head OSPID
select  fcp.user_concurrent_program_name
,       fcr.request_id
,       round(24*60*( sysdate - actual_start_date )) elapsed
,       fu.user_name
,       fcr.oracle_process_id
from    apps.fnd_concurrent_requests fcr
,       apps.fnd_concurrent_programs_tl fcp
,       apps.fnd_user fu
where   fcp.concurrent_program_id = fcr.concurrent_program_id
and     fcp.application_id = fcr.program_application_id
and     fu.user_id = fcr.requested_by
and     fcr.phase_code = 'R';

PROGRAM                                   REQUEST ELAPSED USER_NAME    OSPID
---------------------------------------- -------- ------- ------------ -----
Virtuate GL OLAP Data Refresh             2960541       5 TCS-ADMIN    21681


2. Get current running sql from instance (RAC inst_id/name) where request is running

col inst_name format a10
col sql_text format a30
col module format a20

REM *********************
REM **** RAC VERSION ****
REM *********************
select sess.sid
,      sess.serial#
,      sess.module
,      sess.inst_id
,      inst.inst_name
,      sa.fetches
,      sa.runtime_mem
,      sa.sql_text
,      pro.spid
from   gv$sqlarea sa
,      gv$session sess
,      gv$process pro
,      sys.v_$active_instances inst
where  sa.address = sess.sql_address
and    sa.hash_value = sess.sql_hash_value
and    sess.paddr = pro.addr
and    sess.inst_id = pro.inst_id
and    sess.inst_id = inst.inst_number (+)

and    pro.spid = &OSPID_from_running_request;

If you're running something that has long SQL statements, get the full SQL Statement by selecting from v$sqltext_with_newlines as follows

select t.sql_text
from   gv$sqltext_with_newlines t, s.inst_id gv$session s
where  s.sid = &SID
and    s.sql_address = t.address

order by t.piece

3. Find currently spooling temp file from request

col outfile format a30
col logfile format a30
select cp.plsql_dir || '/' || cp.plsql_out outfile
,      cp.plsql_dir || '/' || cp.plsql_log logfile
from  apps.fnd_concurrent_requests cr
,     apps.fnd_concurrent_processes cp
where cp.concurrent_process_id = cr.controlling_manager
and cr.request_id = &request_id;

OUTFILE                        LOGFILE
------------------------------ ------------------------------

/usr/tmp/PROD/o0068190.tmp     /usr/tmp/PROD/l0068190.tmp

If locks are the potential problem, then drill into those:

Currently held locks per concurrent request

set lines 150
col object_name format a32
col mode_held format a15
select /*+ ordered */
       fcr.request_id
,      object_name
,      object_type
,      decode( l.block
             , 0, 'Not Blocking'
             , 1, 'Blocking'
             , 2, 'Global'
             ) status
,      decode( v.locked_mode
             , 0, 'None'
             , 1, 'Null'
             , 2, 'Row-S (SS)'
             , 3, 'Row-X (SX)'
             , 4, 'Share'
             , 5, 'S/Row-X (SSX)'
             , 6, 'Exclusive'
             , to_char(lmode)
             ) mode_held
from   apps.fnd_concurrent_requests fcr
,      gv$process pro
,      gv$session sess
,      gv$locked_object v
,      gv$lock l
,      dba_objects d
where  fcr.phase_code = 'R'
and    fcr.oracle_process_id = pro.spid (+)
and    pro.addr = sess.paddr (+)
and    sess.sid = v.session_id (+)
and    v.object_id = d.object_id (+)
and    v.object_id = l.id1 (+)
;

I also got these scripts from someone which can be useful and handy at moments / situations.

Cheers !!!