Saturday, August 30, 2014

How to find out if a product is installed in e-Business Suite?

You can find out what products are installed in e-Business Suite, and their patch levels, via Oracle Applications Manager.
You will also find a list of the installed by running the script <adutconf.sql> ($AD_TOP/sql/adutconf.sql), or in the file <applprod.txt>. You can find out the patch level of any product you have installed in e-Business Suite by running the following query via SQL:

$ sqlplus apps / <pwd>

SQL> SELECT substr(APPLICATION_SHORT_NAME,1,8) product, substr(PRODUCT_VERSION,1,14) version, substr(PATCH_LEVEL,1,11) patch_level FROM fnd_application a, fnd_product_installations p WHERE a.application_id = p.application_id ORDER BY application_short_name;

If you know the product code, you can obtain output for individual products with the following SQL:

SQL> SELECT patch_level FROM fnd_product_installations WHERE patch_level LIKE '%<product_code>%';

for example:          WHERE patch_level LIKE 'BIS' 

The product code is the 2- or 3-letter code you see when you search for a patch in Metalink by product family, and in the name of the patchset, e.g. 11i.BIS.L.
As a double check that the database knows that the product is installed, you can run the following query. It should produce the same answer:
SQL> SELECT fa.application_id id, fa.application_short_name app, fa.application_name, fpi.status,
fpi.patch_level FROM fnd_application_all_view fa, fnd_product_installations fpi WHERE fa.application_id = fpi.application_id AND fpi.patch_level LIKE '%<product_code>%';

Friday, August 29, 2014

How to Speed up and Troubleshooting MRP (Log Apply Rate of a Standby Database) Stuck Issues

To Speed up MRP on Standby database, Check for

1) parallel_execution_message_size - this is an OS dependent parameter
2) recovery_parallelism - this will be dictated by the numbers of CPU's and your ability to handle IO

3) Consider increasing sga_target parameter, if it's set to low.
4) Check for Disk I/O. Move you I/O intensive files to faster disks including Online Redo log and Standby redo log files.

I've came across below these links, which I've found very useful in this regards,

http://emrebaransel.blogspot.in/2010/06/mrp-speed-log-apply-rate-of-standby.html

How to resolve MRP stuck issues on a physical standby database? (Doc ID 1221163.1)

Steps to perform for Rolling forward a standby database using RMAN Incremental Backup. (Doc ID 836986.1)

Calculating the Required Network Bandwidth Transfer Of Redo In Data Guard Environments (Doc ID 736755.1)

Initially we had 1 RAC DataGuard in place, but after having 3 more RAC DataGuards (1+3), Our all Standby databases started lagging from Primary RAC Databases resulting into huge gap between them.

The reason was Network Bandwidth wasn't sufficient to handle this Redo Transport thru RFS over the existing network pipe. so I decided to use below formula to calculate required Network Bandwidth so that all 4 Dataguards shouldn't be faced any lag and should be in Sync with Primary.

The formula used (assuming a conservative TCP/IP network overhead of 30%) for calculating the network bandwidth is :

Required bandwidth = ((Redo rate bytes per sec. / 0.7) * 8) / 1,000,000 = bandwidth in Mbps

Measuring the Peak Redo Rate

Use the Oracle Statspack utility for an accurate measurement of the redo rate.

Based on your business you should have a good idea as to what your peak periods of normal business activity are. For example, you may be running an online store which historically sees the peak activity for 4 hours every Monday between 10:00 am - 2:00 pm. Or, you may be running a merchandising database which batch-loads a new catalog every Thursday for 2 hours between 1 am - 3 am. Note that we say "normal" business activity - this means that in certain days of the year you may witness much heavier business volume than usual, e.g. the 2-3 days before Mother's Day or Valentine's Day for an online florist business. Just for those days, perhaps you may allocate higher bandwidth than usual, and you may not consider those as "normal" business activity.However, if such periodic surges of traffic are regularly expected as part of your business operations, you must consider them in your redo rate calculation.

During the peak duration of your business, run a Statspack snapshot at periodic intervals. For example, you may run it three times during your peak hours, each time for a five-minute duration. The Statspack snapshot report will include a "Redo size" line under the "Load Profile" section near the beginning of the report. This line includes the "Per Second" and "Per Transaction" measurements for the redo size in bytes during the snapshot interval. Make a note of the "Per Second" value. Take the highest "Redo size" "Per Second" value of these three snapshots, and that is your peak redo generation rate.

Note that if your primary database is a RAC database, you must run the Statspack snapshot on every RAC instance. Then, for each Statspack snapshot, sum the "Redo Size Per Second" value of each instance, to obtain the net peak redo generation rate for the primary database. Remember that for
a RAC primary database, each node generates its own redo and independently sends that redo to the standby database - hence the reason to sum up the redo rates for each RAC node, to obtain the net peak redo rate for the database.

As an Alternative you can also get the 'Redo rate bytes per sec.' from V$SYSMETRIC_HISTORY, eg.
SQL> select * from v$sysmetric_history where metric_name = 'Redo Generated Per Sec';
or in a RDA-Output:
Performance - AWR Report - Statistic: "redo size"
Example:
Let us assume the redo rate is a 500 KB/sec.

Required bandwidth = ((Redo rate bytes per sec. / 0.7) * 8) / 1,000,000 = bandwidth in Mbps
Required bandwidth = ((512000/0.7) * 8) /1,000,000
Required bandwidth = 4.85 Mbps


Thursday, August 28, 2014

RMAN Active Duplicate Database from RAC ASM to RAC ASM


Some Tips:

1) Always make sure to have relative soft link to tnsnames.ora & listener.ora in $ORACLE_HOME/network/admin if they are in non-default location (like in Oracle EBS: tnsnames.ora & listener.ora exist in $ORACLE_HOME/network/admin/SID_hostname Directory)

2) Make Appropriate TNS Entry in all RAC (Target / Auxiliary) Nodes, with local listener host and port, so that instance name wise service get registered for active duplicate. (Dynamic registration with SCAN doesn't work)

PREPROD=
        (DESCRIPTION=
                (ADDRESS=(PROTOCOL=tcp)(HOST=inblrdrdbadm01.tajhotels.com)(PORT=1555))
            (CONNECT_DATA=
                (SERVICE_NAME=PREPROD1)
                (INSTANCE_NAME=PREPROD1)
                 (UR = A)
            )
        )

3) Mention DB_FILE_NAME_CONVERT & LOG_FILE_NAME_CONVERT properly, if there are any datafiles / online redo log files with OMF/Non-OMF and within same Disk Group.

SQL> show parameter convert

NAME                                 TYPE        VALUE
------------------------------------ ----------- -----------------------------------------------------------------------------
db_file_name_convert                 string      +DATA_DR/EBSPRD, +DATA_DR/PREPROD,
log_file_name_convert                string      +DATA_DR/EBSPRD, +RECO_DR/PREPROD, +RECO_DR/EBSPRD, +RECO_DR/PREPROD


References:

STEP BY STEP RMAN DUPLICATE Database From RAC ASM To RAC ASM (Doc ID 1913937.1)

RMAN DUPLICATE / RESTORE including Standby in ASM with OMF / non-OMF / Mixed Name for Datafile / Online Log / Controlfile (Doc ID 1910175.1)

Error while accessing ASM Instance using asmcmd -p => ORA-04031: unable to allocate 3896 bytes of shared memory ("shared pool","unknown object","sga heap(1,0)","kglsim object batch") (DBD ERROR: error possibly near <*> indicator at char 33 in '/* ASMCMD */ select version from <*>v$instance')

Today, We have faced issue while accessing asmcmd prompt,

Issue:

From 1st Node of Exadata RAC,
[nodedbadm01.tajhotels.com -> oracle:/home/oracle] :asmcmd -p
ORA-04031: unable to allocate 3896 bytes of shared memory ("shared pool","unknown object","sga heap(1,0)","kglsim object batch") (DBD ERROR: error possibly near <*> indicator at char 33 in '/* ASMCMD */ select version from <*>v$instance')
ASMCMD [+] >

From 2nd Node of Exadata RAC,
ASMCMD [+RECO_DC/SIEBPRD/ARCHIVELOG] > rm -rf 2014_08_27
ORA-15032: not all alterations performed
ORA-15177: cannot operate on system aliases (DBD ERROR: OCIStmtExecute)

From our all Cell Nodes, ASM Disks were showing status UNKNOWN,

[root@inmumdcceladm01 ~]# cellcli
CellCLI: Release 11.2.3.2.1 - Production on Mon Sep 01 18:10:11 GMT+05:30 2014

Copyright (c) 2007, 2012, Oracle.  All rights reserved.
Cell Efficiency Ratio: 1,842

CellCLI> LIST GRIDDISK ATTRIBUTES name, asmmodestatus
         DATA_DC_CD_00_inmumdcceladm01   UNKNOWN
         DATA_DC_CD_01_inmumdcceladm01   UNKNOWN
         DATA_DC_CD_02_inmumdcceladm01   UNKNOWN
         DATA_DC_CD_03_inmumdcceladm01   UNKNOWN
         DATA_DC_CD_04_inmumdcceladm01   UNKNOWN
         DATA_DC_CD_05_inmumdcceladm01   UNKNOWN
         DATA_DC_CD_06_inmumdcceladm01   UNKNOWN
         DATA_DC_CD_07_inmumdcceladm01   UNKNOWN
         DATA_DC_CD_08_inmumdcceladm01   UNKNOWN
         DATA_DC_CD_09_inmumdcceladm01   UNKNOWN
         DATA_DC_CD_10_inmumdcceladm01   UNKNOWN
         DATA_DC_CD_11_inmumdcceladm01   UNKNOWN
         DBFS_DG_CD_02_inmumdcceladm01   UNKNOWN
         DBFS_DG_CD_03_inmumdcceladm01   UNKNOWN
         DBFS_DG_CD_04_inmumdcceladm01   UNKNOWN
         DBFS_DG_CD_05_inmumdcceladm01   UNKNOWN
         DBFS_DG_CD_06_inmumdcceladm01   UNKNOWN
         DBFS_DG_CD_07_inmumdcceladm01   UNKNOWN
         DBFS_DG_CD_08_inmumdcceladm01   UNKNOWN
         DBFS_DG_CD_09_inmumdcceladm01   UNKNOWN
         DBFS_DG_CD_10_inmumdcceladm01   UNKNOWN
         DBFS_DG_CD_11_inmumdcceladm01   UNKNOWN
         RECO_DC_CD_00_inmumdcceladm01   UNKNOWN
         RECO_DC_CD_01_inmumdcceladm01   UNKNOWN
         RECO_DC_CD_02_inmumdcceladm01   UNKNOWN
         RECO_DC_CD_03_inmumdcceladm01   UNKNOWN
         RECO_DC_CD_04_inmumdcceladm01   UNKNOWN
         RECO_DC_CD_05_inmumdcceladm01   UNKNOWN
         RECO_DC_CD_06_inmumdcceladm01   UNKNOWN
         RECO_DC_CD_07_inmumdcceladm01   UNKNOWN
         RECO_DC_CD_08_inmumdcceladm01   UNKNOWN
         RECO_DC_CD_09_inmumdcceladm01   UNKNOWN
         RECO_DC_CD_10_inmumdcceladm01   UNKNOWN
         RECO_DC_CD_11_inmumdcceladm01   UNKNOWN

CellCLI> LIST PHYSICALDISK WHERE DISKTYPE=flashdisk
         FLASH_1_0       FL00444G        normal
         FLASH_1_1       FL003YY8        normal
         FLASH_1_2       FL0044EV        normal
         FLASH_1_3       FL0045NR        normal
         FLASH_2_0       FL0044MF        normal
         FLASH_2_1       FL0044ME        normal
         FLASH_2_2       FL0043YZ        normal
         FLASH_2_3       FL0043WL        normal
         FLASH_4_0       FL0044KD        normal
         FLASH_4_1       FL0044EZ        normal
         FLASH_4_2       FL0045F9        normal
         FLASH_4_3       FL0043RF        normal
         FLASH_5_0       FL0044LC        normal
         FLASH_5_1       FL0045K3        normal
         FLASH_5_2       FL0044F6        normal
         FLASH_5_3       FL0045MD        normal

CellCLI> LIST PHYSICALDISK WHERE DISKTYPE=harddisk
         20:0    K8L20N  normal
         20:1    K851NN  normal
         20:2    K8KPXN  normal
         20:3    K8565N  normal
         20:4    K5U4WN  normal
         20:5    K8L29N  normal
         20:6    K860YN  normal
         20:7    K8536N  normal
         20:8    K8KV4N  normal
         20:9    K8L3WN  normal
         20:10   K8L2GN  normal
         20:11   K8KWLN  normal



Reason:
More shared memory is needed than was allocated in the shared pool.

This is identified as Bug 13951456 - ASMCMD does not use bind variables in its SQL - affects shared pool (Doc ID 13951456.8)

Versions confirmed as being affected : 11.2.0.3, 11.2.0.4

Reason for showing ASM Disks UNKNOWN, is basically ASM Instance couldn't able to communicate with Cell Nodes.

Workaround:

1. SQL> Alter system flush shared_pool; in each ASM Instance.

2. Increase the sga_target and shared_pool size at ASM level from other instance.
SQL> alter system set sga_target=1536M sid='*' scope=spfile;
SQL> alter system set shared_pool_size=500M sid='*' scope=spfile;

and restart all the instances in rolling fashion to implement this changes.

3.limiting ASMCMD usage for the time being would help. So, reduce the usage of asmcmd level commands for time being.

4. Bounce the ASM instance in rolling fashion.

Bug Fix List: the 11.2.0.4 Patch Bundles for Oracle Exadata Database Machine (Doc ID 1601749.1)

QUARTERLY DATABASE PATCH FOR EXADATA (JUL2014 - 11.2.0.4.9) - Patch 18840215

Included Content - The following patch bundles were included in 11.2.0.4 BP9 for Exadata

Patch 18522509 - DATABASE PATCH SET UPDATE 11.2.0.4.3 (INCLUDES CPUJUL2014)
Patch 18522515 - OCW PATCH SET UPDATE 11.2.0.4.3
Patch 18825509 - EXADATA DATABASE BUNDLE PATCH 11.2.0.4.9
Patch 18835772 - QUARTERLY DATABASE PATCH FOR EXADATA (JUL 2014 - 11.2.0.3.24)
Patch 18632316 - MONTHLY DATABASE PATCH FOR EXADATA (JUN2014 - 11.2.0.4.8)


We have 11.2.0.3 Exadata Environment, so we are going to apply 18835772 patch for sure. :)

Update after 2 days,

We had to restart ASM Instance in rolling fashion, after increasing SGA_TARGET Parameter after which issue is resolved, also ASM Disks Status come to ONLINE.

1) Stopping All RAC DB Instances on 1st Node using srvctl / sqlplus
2) Stop ASM Instance
3) Create spfile from pfile for ASM Instance (where *.SGA_TARGET=3G from earlier 500M)
4) Startup ASM Instance
5) Startup All RAC DB Instances
6) Perform Same Steps on rest RAC Nodes.

Oracle SR Suggested below Plan of Action (but before that only, our issue was resolved)

FIX: 
=== 
I have verified that the one off patch 13951456 available on base 11.2.0.3 release does not conflict with you current version. Apply the patch 13951456 to fix the issue. 

About the UNKNOWN status of the asmmodestatus, I would recommend restarting the cellsrv and check if the status reflects correctly. 

Wednesday, August 27, 2014

Creating Local Inventory / Central Inventory for EBS R12

I had to apply one opatch on 10.1.2 Oracle Home of EBS R12 (12.1.3) Cloned Instance. This Cloned instance is recently created on the New AIX 7.1 Server, hence inventory wasn't present.

So to apply opatch, I've decided to create Local Inventory followed by Central Inventory (vice-versa is not possible) on new AIX 7.1 Box.

Reference: R12.0 / R12.1 : How To Create, Update or Rebuild The Central Inventory For Oracle Applications E-Business Suite ? (Doc ID 742477.1)

R12.2 : How To Create, Update or Rebuild The Central Inventory For Oracle Applications E-Business Suite ? (Doc ID 1588609.1)

EBS 11i and R12 : Central Inventory & Local Inventory Facts for Install, Clone, Upgrade and Maintaining EBS Instance (Doc ID 1550950.1)

Steps To Recreate Central Inventory(oraInventory) In RDBMS Homes (Doc ID 556834.1)

Steps to Recreate Central Inventory in Real Applications Clusters (Doc ID 413939.1)

Saturday, August 23, 2014

ORA-16191: Primary log shipping client not logged on standby

After Dataguard (Physical Standby) Configuration (RAC/Non-RAC), Many time I've faced issue that, initially archives are not getting transferred from Primary to Standby Database through RFS even though everything is set properly. So I thought, I should write about it this time.

One of the Cause is : "ORA-16191: Primary log shipping client not logged on standby"

You can check the same in with the below query as well as in the alert log,

SQL> select error from v$archive_dest_Status where dest_id=2;

ERROR

-----------------------------------------------------------------
ORA-16191: Primary log shipping client not logged on standby

================ Alert Log Portion (Primary) ================

Suppressing further error logging of LOG_ARCHIVE_DEST_2.

Sat Aug 23 17:43:09 2014
Error 1017 received logging on to the standby
------------------------------------------------------------
Check that the primary and standby are using a password file
and remote_login_passwordfile is set to SHARED or EXCLUSIVE,
and that the SYS password is same in the password files.
returning error ORA-16191

Reason: Check for Password File and verify the same using "sqlplus sys/xxxx@TNS as sysdba" from all Database Instances (in case of RAC) of Primary & Standby and it should get connected.

In my case though, password file was exist on all locations and even it was able to connect with sqlplus from primary to standby and vice versa, but was still getting "ORA-16191: Primary log shipping client not logged on standby"

Workaround/Solution:

1) Disable log_archive_dest_state_2 for which log_archive_dest_2 is Standby Location.

SQL> show parameter log_archive_dest_2

NAME                                 TYPE        VALUE
----------------------------------- ----------- ------------------------------
log_archive_dest_2                   string      SERVICE=EDQPRDBLR LGWR ASYNC 
                                                                       valid_for=(all_logfiles,primary_role)             db_unique_name=EDQPRD

SQL> alter system set log_archive_dest_state_2=DEFER sid='*' scope=both;

2) Recreate password file on 1st RAC Instance with exact Syntax as below,

$ orapwd file=/u01/app/oracle/product/11.2.0.3/EDQPRD/dbs/orapwEDQPRD1 password=oracle entries=10

[oracle@inmumdcdbadm01 dbs]$ ls -l orapwEDQPRD1
-rw-r----- 1 oracle oinstall 2560 Aug 23 17:48 orapwEDQPRD1

The permission should be as it is shown.

3) Replicate (scp) / Recreate on other RAC Instances/Servers with given appropriate Syntax and check for the size.

4) Enable log_archive_dest_state_2 again,

SQL> alter system set log_archive_dest_state_2=ENABLE sid='*' scope=both;

5) Try to make log switches,

SQL> alter system switch all logfile;

================Check for the Alert Log (Primary) ======================

You should see this message,

Sat Aug 23 17:51:37 2014

******************************************************************
LGWR: Setting 'active' archival for destination LOG_ARCHIVE_DEST_2
******************************************************************
Sat Aug 23 17:52:11 2014

Even Error get disappear from below query,

SQL> select inst_id,error from gv$archive_Dest_status where dest_id=2;

   INST_ID ERROR
---------- -----------------------------------------------------------------
         1
         2

Now, your archives which are getting generated at (RAC) Primary Servers would transfer through RFS to (RAC) Standby Servers and will fetch the gap too, provided FAL_CLIENT & FAL_SERVER have been properly mentioned.

Thanks, Your Comments / Suggestions are welcome - Manish

Wednesday, August 20, 2014

Removing Fatal NI connect error 12170 from Database Alert Log

In our one of the 11gR2 Production Database, there were so many entries w.r.t Fatal NI connect error 12170 + TNS-12535: TNS:operation timed outTNS-00505: Operation timed out (with different ports), found in both RAC Instances Alert Log file.

Problem:


Fatal NI connect error 12170.


  VERSION INFORMATION:

        TNS for Linux: Version 11.2.0.3.0 - Production
        Oracle Bequeath NT Protocol Adapter for Linux: Version 11.2.0.3.0 - Production
        TCP/IP NT Protocol Adapter for Linux: Version 11.2.0.3.0 - Production
  Time: 20-AUG-2014 14:17:36
  Tracing not turned on.
  Tns error struct:
    ns main err code: 12535

TNS-12535: TNS:operation timed out

    ns secondary err code: 12560
    nt main err code: 505

TNS-00505: Operation timed out

    nt secondary err code: 110
    nt OS err code: 0
  Client address: (ADDRESS=(PROTOCOL=tcp)(HOST=xxx.xxx.xxx.xxx)(PORT=44326))

Reason:


The same error message was repeating during whole day almost for every application server.

I’ve finally found what was causing the problem. Our database is behind a firewall. Firewall has an “idle session timeout” value. If a connection remains idle for more than the “idle session timeout” value it drops the connections. 

Application developers usually configure their connection pools to remain alive for a long time, which is more than the “idle session timeout” value of the firewall. Not all the connections in the pool are used. Some stays idle. After sometime the firewall drops them and I get those operation timed out problems.


Solution / Workaround: 


1) Add the following line to the sqlnet.ora file on the server.


SQLNET.EXPIRE_TIME=10

In this configuration database will probe the application servers every 10 minutes to see if they are up. In fact this is a mechanism to determine stale connections by database. Because it sends packets to application servers every ten minutes, the connections are detected as active by firewalls and they are not broken.


**In an installation that includes GRID, this parameter should be set in the RDBMS_HOME/network/admin/sqlnet.ora file. This would be the default location for sqlnet.ora file parameters referenced by the instance. 

2) One way to minimize the impact is by using the parameter SQLNET.INBOUND_CONNECT_TIMEOUT (default to 60 seconds on 10gR2 and 11g) but, sometimes, this value is not adequate.

Oracle also mention the occurrence of this error if you use DB Console or Enterprise Manager to monitor your databases and the em agent will try to connect to the target database repeatedly and, statistically, some will fail (frequency will depend on how busy your system is).
Most of the time (certainly for DB Console and Enterprise Manager Agent) the application will try to connect again and it will succeed.
To fix the problem you could increase the value of SQLNET.INBOUND_CONNECT_TIMEOUT (in Seconds) in the sqlnet.ora / CONNECT_TIMEOUT_<DB_Name> (in Minutes) in the listener.ora file located on the server side.
3) If you already have a value you have considered adequate, you might want to add the following line on your listener.ora file:
DIAG_ADR_ENABLED_<listener_name>=OFF  (in listener.ora)
DIAG_ADR_ENABLED=OFF in sqlnet.ora
This line will hide the error on the alert.log file and the error will be posted on the $ORACLE_HOME/network/log/sqlnet.log file

Tuesday, August 19, 2014

Multiplexing Control File 11gR2 RAC with ASM

1. Connect to the database from any one for the RAC node and check the current control file status. Here, I am having my control file in the +DATA_DC disk group and I want to add another disk group +RECO_DC for multiplexing/mirroring.

SQL> show parameter control_files

NAME                                 TYPE        VALUE
------------------------------------ ----------- ----------------------------------------------------
control_files                        string      +DATA_DC/PROD/controlfile/current.41119.852341131

SQL> select name from v$controlfile;

NAME
--------------------------------------------------------------------------------
+DATA_DC/PROD/controlfile/current.41119.852341131

2. Add the +RECO_DC Diskgroup to the control file by issuing the below command.

SQL> alter system set control_files='+DATA_DC/PROD/controlfile/current.41119.852341131','+RECO_DC' scope=spfile;

System altered.

3. Shutdown the RAC database and start the database in nomount mode.

[oracle@dbrac1 ~]$ srvctl stop database -d PROD

[oracle@dbrac1 ~]$ srvctl status database -d PROD
Instance PROD1 is not running on node inmumdcdbadm01
Instance PROD2 is not running on node inmumdcdbadm02

4. Connect to the DB and you will see the +RECO_DC being part of the control file.

[oracle@dbrac1 ~]$ sqlplus "/as sysdba"

SQL> startup nomount
ORACLE instance started.

Total System Global Area 1.7103E+10 bytes
Fixed Size                  2245512 bytes
Variable Size            3724545144 bytes
Database Buffers         1.3355E+10 bytes
Redo Buffers               21708800 bytes

SQL> show parameter control_files

NAME                                 TYPE        VALUE
------------------------------------ ----------- --------------------------------------------------------------
control_files                        string      +DATA_DC/PROD/controlfile/current.41119.852341131, +RECO_DC

5. Connect to the RMAN and issuer restore command to create a copy of the control file from the +RECO_DC disk group.

[oracle@dbrac1 ~]$ rman target /

Recovery Manager: Release 11.2.0.3.0 - Production on Tue Aug 19 19:06:13 2014

Copyright (c) 1982, 2011, Oracle and/or its affiliates.  All rights reserved.

connected to target database: PROD (not mounted)

RMAN> restore controlfile from '+DATA_DC/PROD/controlfile/current.41119.852341131';

Starting restore at 19-AUG-14
using target database control file instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: SID=1415 instance=PROD1 device type=DISK

channel ORA_DISK_1: copied control file copy
output file name=+DATA_DC/PROD/controlfile/current.41119.852341131
output file name=+RECO_DC/PROD/controlfile/current.3123.856033587
Finished restore at 19-AUG-14

RMAN> exit

6. Connect back to the data base and issue alter command with the exact path of the file that was created by the restore of RMAN.

[oracle@dbrac1 ~]$ sqlplus "/as sysdba"

SQL> alter system set control_files='+DATA_DC/PROD/controlfile/current.41119.852341131','+RECO_DC/PROD/controlfile/current.3123.856033587' scope=spfile;

System altered.

7. Now, shutdown the database and start up. Use the below commands.

SQL> Shutdown immediate;

ORA-01507: database not mounted
ORACLE instance shut down.

[oracle@dbrac1 ~]$ srvctl start database -d PROD
Instance PROD1 is running on node inmumdcdbadm01
Instance PROD2 is running on node inmumdcdbadm02

8. Check by connecting to the RAC database, you will see the two copies of control files in two different disk groups, +DATA_DC and +RECO_DC.

[oracle@dbrac1 ~]$ sqlplus "/ as sysdba"

SQL> select name from v$controlfile;

NAME
--------------------------------------------------------------------------------
+DATA_DC/PROD/controlfile/current.41119.852341131
+RECO_DC/PROD/controlfile/current.3123.856033587

SQL> exit

Saturday, August 16, 2014

Bug 16562733 : NODE EVICTION DUE TO FAILED IO OF VOTING DISK FROM CELL SERVER

From couple of days, In our Exadata Environment we were facing issue of rebooting Exadata Database Servers Intermittently. It was due to cssd was crashing due to voting disk offline. 


We could see that the VF IO error reported and then Voting disk went offline which triggered the node reboot randomly. 



ocssd.log ( inblrdrdbadm01) 

~~~~~~~~~~~~~~~~~~~~~~~~~~~ 
2014-08-05 16:27:19.041: [ SKGFD][1107020096]ERROR: -10(OSS Operation ioerror failed with error 12 [Network error]
)
2014-08-05 16:27:19.041: [ CSSD][1107020096](:CSSNM00060:)clssnmvReadBlocks: read failed at offset 16 of o/192.168.10.4/DBFS_DG_CD_02_inblrdrceladm02
2014-08-05 16:27:19.041: [ CSSD][1107020096]clssnmSetupReadLease: status 1
...............
2014-08-05 16:27:26.021: [ CSSD][1097800000]clssnmvStatusBlkInit: myinfo nodename inblrdrdbadm01, uniqueness 1407150297
2014-08-05 16:27:26.021: [ CSSD][1097800000]clssnmvDiskAvailabilityChange: voting file o/192.168.10.4/DBFS_DG_CD_02_inblrdrceladm02 now online
2014-08-05 16:27:26.022: [ SKGFD][1107020096]ERROR: -10(OSS Operation oss_open failed with error 5 [Failed to connect to a cell]
)
2014-08-05 16:27:26.022: [ CSSD][1107020096]clssnmvGetDiskHandle: Unable to open disk o/192.168.10.4/DBFS_DG_CD_02_inblrdrceladm02
2014-08-05 16:27:26.022: [ CSSD][1107020096]clssnmvWorkerThread:failed to open o/192.168.10.4/DBFS_DG_CD_02_inblrdrceladm02
2014-08-05 16:27:26.022: [ CSSD][1107020096]###################################
2014-08-05 16:27:26.022: [ CSSD][1107020096]clssscExit: CSSD signal 11 in thread clssnmvWorkerThread
2014-08-05 16:27:26.022: [ CSSD][1107020096]###################################
2014-08-05 16:27:26.022: [ CSSD][1107020096](:CSSSC00012:)clssscExit: A fatal error occurred and the CSS daemon is terminating abnormally
2014-08-05 16:27:26.022: [ CSSD][1107020096] 


cssdOUT.log 
~~~~~~~~~~~~ 
08/04/14 16:27:26: CSSD starting
08/04/14 16:34:56: CSSD starting
08/05/14 16:27:26: CSSD handling signal 11
08/05/14 16:27:26: Dumping CSSD state and exiting 



With the help of Oracle Support, they further diagnosed with the same issue. 


----- Call Stack Trace ----- 

calling call entry argument values in hex 
location type point (? means dubious value) 
-------------------- -------- -------------------- ---------------------------- 
clssscExit()+740 call kgdsdst() 000000000 ? 000000000 ?
041FB6D28 ? 000000001 ?
7FE200000001 ? 000000003 ?
s0clsssc_sighandler call clssscExit() 7FE27C1D87C0 ? 000000002 ?
()+611 041FB6D28 ? 000000001 ?
7FE200000001 ? 000000003 ?
__sighandler() call s0clsssc_sighandler 00000000B ? 000000002 ?
() 041FB6D28 ? 000000001 ?
7FE200000001 ? 000000003 ?
clsfInitIO()+40 signal __sighandler() 7FE27C1FA3B0 ? 7FE27CCFF990 ?
000000000 ? 000000004 ?
000000001 ? 7FE280129A00 ?
clssnmvReadBlocks() call clsfInitIO() 7FE27C1FA3B0 ? 7FE27CCFF990 ?
+1136 000000000 ? 000000004 ?
7FE200000001 ? 7FE280129A00 ?
clssnmvVoteDiskVali call clssnmvReadBlocks() 7FE27C1D87C0 ? 000E2A730 ?
dation()+130 000000000 ? 000000004 ?
000000004 ? 7FE280129A00 ?
clssnmvWorkerThread call clssnmvVoteDiskVali 7FE27C1D87C0 ? 7FE27C025FC0 ?
()+1183 dation() 000000000 ? 000E2A730 ?
000000004 ? 7FE280129A00 ?
clssscthrdmain()+25 call clssnmvWorkerThread 7FE27C1D87C0 ? 7FE27C025FC0 ?
3 () 000000000 ? 000E2A730 ?
000000004 ? 7FE280129A00 ?
start_thread()+221 call clssscthrdmain() 7FE27C1D87C0 ? 7FE27C025FC0 ?
7FE27C025FC0 ? 000E2A73 


The above call stack has reported for every node reboot time. 

This is due to Bug.16562733 CSSD crash / node eviction due to failed IO against the voting disk 

Rediscovery information 
~~~~~~~~~~~~~~~~~~~~~~~~ 
cssd may crash when a voting disk open fails 

Rediscovery Notes:
cssd crashes with a stack like:
clssscExit()+740
s0clsssc_sighandler
__sighandler()
clsfInitIO()+40
clssnmvReadBlocks()
clssnmvVoteDiskValidation()+130
clssnmvWorkerThread
clssscthrdmain()+25
start_thread()+221 


Workaround: 
None 

As per plan, we were applied 11.2.0.3 Bundle Patch 24 (Jul-2014 PSU) which even fixed our this issue.

The bug.16562733 has been fixed in 11.2.0.3 BP 24 as part of Grid PSU 11.2.0.3.9 

Monday, August 11, 2014

REP-0004: Warning: Unable to open user preference file

Symptoms:
All concurrent requests that run REPORTS are printing the following warning in the request log.
“REP-0004: Warning: Unable to open user preference file”
Your local (customized) Oracle Reports preference file could not be opened.
This is just a warning, so the product will continue to run even if this occurs.
The possible causes of this error include the following:
CAUSE 1: The file was not found under the specified name in the specified location
CAUSE 2: You lacked the necessary privileges to open the file.

SOLUTION [ID 1120529.1] :
To resolve the warning, copy the prefs.ora file from your Reports Builder $ORACLE_HOME/tools/admin/ directory into the Applications $HOME directory.

Example:
Copy of the file prefs.ora in the directory /u01/prod/apps/tech_st/10.1.2/tools/admin to the directory  /home/applprod/
This should solve the Problem !!!

Wednesday, August 6, 2014

Exadata Commands to generate Logs for Troubleshooting

1. Specify the exact time of issue to focus. 

2. /var/log/messages file covering the time of reboot from both nodes. 

3. Opatch 

cd $ORACLE_HOME/OPatch 

./opatch -lsinventory 

4) Please upload diagcollection.sh output files from all DB nodes. 

# to know the eviction information from cluster front. 

$GRID_HOME/bin/diagcollection.sh 
For diagcollection refer to CRS 10gR2/ 11gR1/ 11gR2 Diagnostic Collection Guide ( Doc ID 330358.1 ) 

# script /tmp/diag.log 
# id 
# env 
# cd <temp-directory-with-plenty-free-space> 
# $GRID_HOME/bin/diagcollection.sh 
# exit 

The following .gz files will be generated in the current directory and need to be uploaded along with /tmp/diag.log: 

crsData_<hostname>.tar.gz, 
ocrData_<hostname>.tar.gz, 
oraData_<hostname>.tar.gz, 
os_<hostname>.tar.gz 

4) Please upload alert.log from DB and ASM instances from all DB nodes 

5) Grid home logs from all db nodes. 

% cd $grid_home/log/<hostname> 
% find . | xargs grep '2011-11-29' -sl 

or 

cd $GRID_HOME/log 
tar -cvzf node1.tar.gz * 

6)Cell trace & logs 

Upload logs for me please from the problem cell, covering problem time. 

% cd $ADR_BASE/diag/asm/cell/<hostname>/trace/ 
% egrep -lr 2012-09-06 * | xargs tar cvfz `hostname -s`_cell_diag.tar.gz 

7) Please upload the Storage Cell alert.log and Storage Cell alert history: from all cell nodes 

a.) /opt/oracle/cell/log/diag/asm/cell/{node name}/trace/alert.log 
b.) # cellcli -l show alerthistory 

8) OS watcher information: 

location of logs==> /opt/oracle.oswatcher/osw/archive 

cd /opt/oracle.oswatcher/osw/archive 
find . -name '*12.09.05*' -exec zip /tmp/osw_`hostname -a`.zip {} \; 

cd /opt/oracle.oswatcher/osw/archive - 
->Change date and time covering problem time. 
find . -name "*11.10.13.1[8-9]00*" -exec zip /tmp/osw_`hostname -a`_0503.zip {} \; 
-->please do not miss ';' at the last in the above command 

To get OS watcher data of specific date : 
cd /opt/oracle.oswatcher/osw/archive 
find . -name '*12.01.13*' -print -exec zip /tmp/osw_`hostname`.zip {} \; 
where 12- year 01- Month 13-day 

9) Linux OFA & kernel information 

a) dcli -l root -c <list of dbnodes> "rpm -qa | grep ofa" 
b). dcli -l root -c <list of dbnodes> "uname -a" 

10) Linux crash file information 

# to check the possibility of LINUX PROBLEM 

grep path /etc/kdump.conf will give the core file location 

Please check for a vmcore/crashcore that may have been generated during the reboot. 

The cells and database servers of Oracle Exadata Database Machine are configured to generate Linux kernel crash core files when there is a Linux crash. 
Common locations are /var/crash or /u01/crashfiles on database nodes, and /var/log/oracle/crashfiles on cell nodes. 

Start in this file: 
cat /etc/kdump.conf 
Look for the uncommented filesystem (ext3) and path. 
example: 
ext3 /dev/md11 
path /crashfiles 

The location is the /crashfiles folder from the mount point of /dev/md11. 
The find where md11 is mounted: 
df 

example: 
/dev/md11 2395452 280896 1992872 13% /var/log/oracle 

Then change to that directory: 
cd /var/log/oracle/crashfiles 


11) SOS report 

sosreport requires root permissions to run. 

# /usr/sbin/sosreport 
The sosreport will run for several minutes, according to different system, the running time maybe more longer. 

Once completed, “sosreport” will generate a compressed a bz2 file under /tmp. 


12) Run infinicheck (from any of the db nodes) command: 

# /opt/oracle.SupportTools/ibdiagtools/infinicheck -g /opt/oracle.SupportTools/onecommand/dbs_ib_group |tee /tmp/infinicheck_`hostname`.log 

where dbs_ib_group contains the hostnames/IPoIB of the compute nodes part of the cluster. It has to be using the IB subnet. 

Note: The location for dbs_ib_group is for Oracle Sun DB Machine. For other systems, specify the location of the file 


13) Run sundiag.sh

The script needs to be executed as root on the Exadata Storage server having disk problems and sometimes also on Db nodes or Storage Servers for some other hardware issues.

For gathering sundiag.sh - output across a whole rack using dcli, the outputs may end up with the same tarball name which will overwrite each other upon unzipping.  To avoid this, use the following from DB01:

1. [root@exadb01 ~]# cd /opt/oracle.SupportTools/onecommand (or wherever the all_group file is with the list of the rack hostnames)

2. [root@exadb01 onecommand]# dcli -g all_group -l root /opt/oracle.SupportTools/sundiag.sh 2>&1
<this will take up to about 2 minutes>

3. Verify there is output in /tmp on each node:
[root@exadb01 onecommand]# dcli -g all_group -l root --serial 'ls -l /tmp/sundiag* '


For gathering OS Watcher data alongside sundiag: 

# /opt/oracle.SupportTools/sundiag.sh osw

Execution will create a date stamped tar.bz2 file in /tmp/sundiag_/tar.bz2 including OS Watcher archive logs. These logs may be very large.


# /opt/oracle.SupportTools/sundiag.sh -h

Oracle Exadata Database Machine - Diagnostics Collection Tool

Version: 1.5.1_20140521

Usage: ./sundiag.sh [osw] [ilom | snapshot]
   osw      - Copy ExaWatcher or OSWatcher log files (Can be several 100MB's)
   ilom     - User level ILOM data gathering option via ipmitool, in place of
              separately using root login to get ILOM snapshot over the network.
   snapshot - Collects node ILOM snapshot- requires host root password for ILOM
              to send snapshot data over the network.


Exadata Diagnostic Collection Guide (Doc ID 1353073.2)