Skip to main content

Oracle error 1013 during filling up usn avail cache | Error 3113


Oracle error 1013 during filling up usn avail cache:
One of the Oracle RAC database environment, I used support for Oracle 11gR2 was facing performance issue. User were not able to insert record in a timely manner. Update was also very slow across entire RAC environment. Customer experience was not satisfactory and system wide performance was impacted. Time to generate AWR, ASH for the Oracle RAC environment.
Fun was started, after I looked at the top timed events in AWR, I was able to see the wait class for concurrency, cluster and other, such as Concurrency - row cache lock, Cluster - gc buffer busy acquire, Cluster - gc buffer busy acquire, Other - enq: US - contention, Cluster - gc current block busy, Other - enq: TA - contention. Top timed Foreground Events was also reflecting same wait event in the same class. Top Timed Background Event was System I/O, control file sequential read.
SysStat and Global messaging – RAC was reflecting high GCS received on one of the node, near to double compare to other nodes.
Let us explore the SQL ordered by Elapsed Time. Highest elapsed time was for update, insert and select. 
Update statement:
update /*+ rule */ undo$ set name=:2, file#=:3, block#=:4, status$=:5, user#=:6, undosqn=:7, xactsqn=:8, scnbas=:9, scnwrp=:10, inst#=:11, ts#=:12, spare1=:13 where us#=:1 
was the statement with highest elapsed time.
What the hack is going on? What is happening? Update statement - update undo$ was consuming near to 50% cpu time. Update statement – update $undo was facing highest cluster wait time. Segment statistics in global AWR and Segment by Global Cache Buffer Busy for each node was reflecting UNDO$ with highest GC Buffer Busy, and % of capture was more than 90%.
When we execute DML (insert, update, delete) Oracle server process has to write undo, so in case of rollback, original data is available in undo. Nothing wrong with the update statement – update /*rule */ redo, but the higher cluster wait event is the pain. Because of Bug in the 11gR2, require no. of online rollback segments are not available, so there is a contention. 
By setting _rollback_segment_count to 10000 you can force system to keep 10000 rollback segment online, so DML activity against database can be perform efficiently. It's a feature of ACID compliant RDBMS. You will be able to find several bugs in Oracle support related to rollback segments also for different situation there is different parameter, so make sure to work with Oracle support while setting hidden parameter / underscore parameter for your Oracle database environment. So undo / rollback segments are still pain in the neck for the system with higher no. of DML activity / higher no. of concurrent transaction.
Each instance alert.log file was reflecting “Error 1013 during filling up usn avail cache”, and “Error 3113 during filling up usn avail cache”.
I had enough information to troubleshoot the performance of the database environment. Refer Oracle support Notes: 
Alert log file - Error 1013 during filling up usn avail cache" report in instance alert log (Doc ID 1630507.1)IF: Undo Wait Event - Enq: US - contention (Doc ID 1951680.1)
How to correct performance issues with enq: US - contention related to undo segments (Doc ID 1332738.1)
After spending some time, I was able to find out it’s Oracle Bug and couple of bugs are associated with same issues.
Bug 17560426 : ORA-1013 DURING FILLING UP USN AVAIL CACHE IN OTHER NODES AFTER INSTANCE CRASH.
Bug 17905112 : HIGH US, TA CONTENTION AND ERROR 1013 DURING FILLING UP USN AVAIL CACHE ON RAC.
Bug 21818185 : DB HUNG WITH ERROR 1013 DURING FILLING UP USN AVAIL CACHE IN ALERT.LOG.
Bug 19545828 : ERROR 1013 DURING FILLING UP USN AVAIL CACHE.
Bug 20010589 : DATABASE HANGS WITH ENQ: TA CONTENTION AS THE TOP WAIT EVENT.
Bug 20440571 : FIX THE ERROR HANDLING CODE IN KTUSMFILLAVAILCACHE IN 11.2.
Temporary work around might work by setting hidden / underscore parameter _rollback_segement_count to 10000. The no. of rollback segments depends on no. of transaction executed against database. Make sure you derive no. near to real time otherwise same issue, you might need to increase it again. 
Because of _rollback_segement_count your alert.log file will filled with the messages SMON: online AUM segments … and SMON: number of AUM segments … Another option is to apply the 11gR2 patchset.
Doc ID 1562142.1 – Document 11.2.0.4 Patch Set - List of Bug Fixes by Problem Type (Doc ID 1562142.1), has listed “13508485 – JDBC Statement.cancel can cause ORA-1013 from a Rollback”, under JDBC category. Same issue is fixed in 12.1.0.1 base release also.
Wish you wonderful time to get the control of your RAC environment, and troubleshoot Oracle RAC performance issue.

Comments

Popular posts from this blog

MySQL InnoDB cluster troubleshooting | commands

Cluster Validation: select * from performance_schema.replication_group_members; All members should be online. select instance_name, mysql_server_uuid, addresses from  mysql_innodb_cluster_metadata.instances; All instances should return same value for mysql_server_uuid SELECT @@GTID_EXECUTED; All nodes should return same value Frequently use commands: mysql> SET SQL_LOG_BIN = 0;  mysql> stop group_replication; mysql> set global super_read_only=0; mysql> drop database mysql_innodb_cluster_metadata; mysql> RESET MASTER; mysql> RESET SLAVE ALL; JS > var cluster = dba.getCluster() JS > var cluster = dba.getCluster("<Cluster_name>") JS > var cluster = dba.createCluster('name') JS > cluster.removeInstance('root@<IP_Address>:<Port_No>',{force: true}) JS > cluster.addInstance('root@<IP add>,:<port>') JS > cluster.addInstance('root@ <IP add>,:<port> ') JS > dba.getC...

MySQL slave Error_code: 1032 | MySQL slave drift | HA_ERR_KEY_NOT_FOUND

MySQL slave Error_code: 1032 | MySQL slave drift: With several MySQL, instance with master slave replication, I have one analytics MySQL, environment which is larger in terabytes, compared to other MySQL instances in the environment. Other MySQL instances with terabytes of data are running fine master, slave replication. But this analytics environment get started generating slave Error_code :1032. mysql> show slave status; Near relay log: Error_code: 1032; Can't find record in '<table_name>', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log <name>-bin.000047, end_log_pos 5255306 Near master section: Could not execute Update_rows event on table <db_name>.<table_name>; Can't find record in '<table_name>', Error_code: 1032; Can't find record in '<table_name>', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log <name>-bin.000047, end_l...

InnoDB cluster Remove Instance Force | Add InnoDB instance

InnoDB cluster environment UUID is different on node: To fix it stop group replication, remove instance (use force if require), add instance back Identify the node which is not in sync: Execute following SQL statement on each node and identify the node has different UUID on all nodes. mysql> select * from mysql_innodb_cluster_metadata.instances; Stop group replication: Stop group replication on the node which does not have same UUID on all nodes. mysql > stop GROUP_REPLICATION; Remove instances from cluster: Remove all secondary node from the cluster and add them back if require. $mysqlsh JS >\c root@<IP_Address>:<Port_No> JS > dba.getCluster().status() JS > dba.getCluster () <Cluster:cluster_name> JS > var cluster = dba.getCluster("cluster_name"); JS >  cluster.removeInstance('root@<IP_Address>:<Port_No>'); If you get "Cluster.removeInstance: Timeout reached waiting......" JS > cluster.removeInstance(...