ORA-00353: during startup of instance from archiver process

When starting one of the databases we received the ORA-00353 error. It indicates that archiver process couldn’t archive one of the redo logs due to corruption. Oracle tried to read both the logmembers from the same group but they both were corrupted and instance crashes when trying to open it.

ARC0: STARTING ARCH PROCESSES COMPLETE
ARCH: Log corruption near block 88727 change 12629319 time ?
CORRUPTION DETECTED: thread 1 sequence 221 log 2 at block 88727. Arch found corrupt blocks
Errors in file i:\db\oracle\testdb\diagnostic_dest\diag\rdbms\testdb_a\testdb\trace\testdb_ora_3796.trc (incident=25175):
ORA-00353: log corruption near block 88727 change 12629319 time 05/07/2012 21:13:47
ORA-00312: online log 2 thread 1: ‘I:\DB\ORACLE\TESTDB\ORADATA\REDO02B.LOG’
ORA-00312: online log 2 thread 1: ‘J:\DB\ORACLE\TESTDB\ORADATA\REDO02A.LOG’

USER (ospid: 3796): terminating the instance due to error 16038
Instance terminated by USER, pid = 3796

So to confirm the first step was after database was open in mount state to verify both online redo logfiles in the group had corruption.

SQL> ALTER SYSTEM DUMP LOGFILE ‘I:\DB\ORACLE\TESTDB\ORADATA\REDO02B.LOG’;
ALTER SYSTEM DUMP LOGFILE ‘I:\DB\ORACLE\TESTDB\ORADATA\REDO02B.LOG’
*
ERROR at line 1:
ORA-00354: corrupt redo log block header
ORA-00353: log corruption near block 88727 change 12629410 time 05/08/2012
09:01:28
ORA-00334: archived log: ‘I:\DB\ORACLE\TESTDB\ORADATA\REDO02B.LOG’

SQL> ALTER SYSTEM DUMP LOGFILE ‘J:\DB\ORACLE\TESTDB\ORADATA\REDO02A.LOG’;
ALTER SYSTEM DUMP LOGFILE ‘J:\DB\ORACLE\TESTDB\ORADATA\REDO02A.LOG’
*
ERROR at line 1:
ORA-00354: corrupt redo log block header
ORA-00353: log corruption near block 88727 change 12629410 time 05/08/2012
09:01:28
ORA-00334: archived log: ‘I:\DB\ORACLE\TESTDB\ORADATA\REDO02A.LOG’

In this case both files had corruption but note if only of them had corruption you one can copy the file that is no corruption to the corrupted one.

After it was confirmed both files had corruption, using the following steps one can startup the database it will require changing the database to noarchivelog mode so the archiver process doesn’t try to archive the online redo logs. Note: You should take a full backup before you start this step and after this step is done.

SQL> shutdown immediate;
ORA-01109: database not open
Database dismounted.
ORACLE instance shut down.

SQL> startup mount;
ORACLE instance started.

Total System Global Area 388354048 bytes
Fixed Size 2175968 bytes
Variable Size 331353120 bytes
Database Buffers 50331648 bytes
Redo Buffers 4493312 bytes
Database mounted.

SQL> ALTER DATABASE NOARCHIVELOG;
Database altered.

SQL> ALTER DATABASE OPEN;
Database altered.

SQL> shutdown immediate;
Database closed.
Database dismounted.
ORACLE instance shut down.

SQL> startup nomount;
ORACLE instance started.
Total System Global Area 388354048 bytes
Fixed Size 2175968 bytes
Variable Size 331353120 bytes
Database Buffers 50331648 bytes
Redo Buffers 4493312 bytes

SQL> alter database mount;
Database altered.

SQL> alter database archivelog;
Database altered.

SQL> alter database open;
Database altered.

How to find and fix block corruption using RMAN?

One of the scenario we ran into when one of the data file reported there was block corruption as RMAN reported the following message in its logs. Note: This scenario was done on 9.2.0.1 running Linux.

RMAN-03009: failure of backup command on ch1 channel at 07/02/2009 04:27:06
ORA-19566: exceeded limit of 0 corrupt blocks for file /u01/oradata/TESTDB/TEST_data_01.dbf

And the alert.log (alert_TESTDB.log) also had the following message by RMAN
Reread of blocknum=443343, file=/u01/oradata/TESTDB/TEST_data_01.dbf. found same corrupt data
***
Corrupt block relative dba: 0x0346c3cf (file 13, block 443343)
Bad check value found during backing up datafile
Data in bad block –
type: 6 format: 2 rdba: 0x0346c3cf
last change scn: 0x0000.32a8f165 seq: 0x1 flg: 0x04
consistency value in tail: 0xf1650601
check value in block header: 0xeb4f, computed block checksum: 0xec16
spare1: 0x0, spare2: 0x0, spare3: 0x0

So to double check one can ran dbverify (dbv) or RMAN to validate the datafile which confirmed data corruption and the datafile# and the block#.
$ dbv blocksize=8192 file=/u01/oradata/TESTDB/TEST_data_01.dbf > /tmp/TEST_data_01_file.log 2>&1

DBVERIFY – Verification starting : FILE = /u01/oradata/TESTDB/TEST_data_01.dbf
Page 443343 is marked corrupt
***
Corrupt block relative dba: 0x0346c3cf (file 13, block 443343)
Bad check value found during dbv:
Data in bad block –
type: 6 format: 2 rdba: 0x0346c3cf
last change scn: 0x0000.32a8f165 seq: 0x1 flg: 0x04
consistency value in tail: 0xf1650601
check value in block header: 0xeb4f, computed block checksum: 0xec16
spare1: 0x0, spare2: 0x0, spare3: 0x0
***

DBVERIFY – Verification complete

Total Pages Examined : 486400
Total Pages Processed (Data) : 473439
Total Pages Failing (Data) : 0
Total Pages Processed (Index): 0
Total Pages Failing (Index): 0
Total Pages Processed (Other): 10
Total Pages Processed (Seg) : 0
Total Pages Failing (Seg) : 0
Total Pages Empty : 12950
Total Pages Marked Corrupt : 1
Total Pages Influx : 0

And using RMAN one can validate the same thing by checking the view v$database_block_corruption after running the script below.

RMAN> connect target /

RMAN> backup validate check logical database;

It reported the same block as show below:

SQL> SELECT * FROM v$database_block_corruption;

FILE# BLOCK# BLOCKS CORRUPTION_CHANGE# CORRUPTIO
———- ———- ———- —————— ———
13 443343 1 0 FRACTURED

After it was confirmed there was only one datafile with one block corruption using block recover we were able to recover the datafile without shutting down the database. The channel allocated was the same that was specified when backing up the data file.

connect target /
connect catalog rmancatalog/rmancatalog@catalog
run {
allocate channel…;
blockrecover datafile 13 block 443343;
release channel …;
}

Log file:
Starting blockrecover at 02-JUL-09

channel ch1: restoring block(s)
channel ch1: specifying block(s) to restore from backup set
restoring blocks of datafile 00013
channel ch1: restored block(s) from backup piece 1
piece handle=g0kirlhe_1_1 tag=TAG20090630T040048 params=NULL
channel ch1: block restore complete

starting media recovery

archive log thread 1 sequence 30644 is already on disk as file /u01/oradata/TESTDB/arch/1_30644.dbf
archive log thread 1 sequence 30645 is already on disk as file /u01/oradata/TESTDB/arch/1_30645.dbf
channel ch1: starting archive log restore to default destination
channel ch1: restoring archive log
archive log thread=1 sequence=30643
channel ch1: restored backup piece 1
piece handle=hlkj27vr_1_1 tag=TAG20090701T160827 params=NULL
channel ch1: restore complete
media recovery complete
Finished blockrecover at 02-JUL-09
released channel: ch1

Recovery Manager complete.

And after the block was restored we validated the datafile through RMAN and v$database_block_corruption reported no records found.

RMAN> connect target /
RMAN> backup validate check logical datafile 13;

Starting backup at 02-JUL-09
using target database controlfile instead of recovery catalog
allocated channel: ORA_DISK_1
channel ORA_DISK_1: sid=22 devtype=DISK
channel ORA_DISK_1: starting full datafile backupset
channel ORA_DISK_1: specifying datafile(s) in backupset
input datafile fno=00013 name=/u01/oradata/TESTDB/TEST_data_01.dbf
channel ORA_DISK_1: backup set complete, elapsed time: 00:07:35
Finished backup at 02-JUL-09

sys@TESTDB> select * FROM v$database_block_corruption;

no rows selected

How to check physical and logical data corruption using RMAN?

If you are not using RMAN to backup data and use technologies like NetApp snapshot which doesn’t check for physical and logical corruption, one can still use RMAN to check for physical and logical corruption in the datafiles and not back the data up.

connect target /

run {
# set disk to be default device type
CONFIGURE DEFAULT DEVICE TYPE TO DISK;
# number of concurrent sessions to spawn
CONFIGURE DEVICE TYPE DISK PARALLELISM 10 BACKUP TYPE TO BACKUPSET;
# check datafiles for corruption 10 datafiles in each session
BACKUP VALIDATE CHECK LOGICAL DATABASE FILESPERSET=10;
}

Corruptions information are recorded in the V$DATABASE_BLOCK_CORRUPTION view.

How to fix Online redo log corruption?

Today, due to a network issue the host (Solaries) lost contact with the NetApp filer and it corrupted some datafiles and redo log files.
The solution was to recreate all indexes from the corrupted datafile in a new tablespace and drop the corrupted tablespace.
The archived log process hung because it could not archive the log 464704 which got corrupted, the solution was to clear the redo log and manually switch the logs to test the fix.
1- Archived log hung because the online redo log was corrupted and could not be archived.

SQL> select GROUP#,THREAD#,SEQUENCE#,BYTES,MEMBERS,ARC,STATUS from v$log;

GROUP# THREAD# SEQUENCE# BYTES MEMBERS ARC STATUS
———- ———- ———- ———- ———- — —————-
1 1 464713 104857600 2 NO CURRENT
2 1 464711 104857600 2 NO INACTIVE
3 1 464709 104857600 2 NO INACTIVE
4 1 464704 104857600 2 NO INACTIVE
5 1 464712 104857600 2 NO INACTIVE
6 1 464710 104857600 2 NO INACTIVE

6 rows selected.

2- When tried to clear the online redo log it wouldn’t let us do it because of the datafile corrupted as well.

SQL> ALTER DATABASE CLEAR UNARCHIVED LOGFILE GROUP 4;
ALTER DATABASE CLEAR UNARCHIVED LOGFILE GROUP 4
*
ERROR at line 1:
ORA-00393: log 4 of thread 1 is needed for recovery of offline datafiles
ORA-00312: online log 4 thread 1: ‘/oracle/EPB/WISPRD/redo/redo2/redo4b.log’
ORA-00312: online log 4 thread 1: ‘/oracle/EPB/WISPRD/redo/redo1/redo4a.log’
ORA-01110: data file 281: ‘/oracle/EPB/WISPRD/data/cust_index_jul2007.dbf’

3- Found an additional clause to force the CLEAR.

SQL> ALTER DATABASE CLEAR UNARCHIVED LOGFILE GROUP 4 UNRECOVERABLE DATAFILE ;

Database altered.

4- Online redo log group is now clear but keep in mind that we don’t have that archive log (464704) and recoverability is compromised at this point. We need to take a full snap backup.

SQL> select GROUP#,THREAD#,SEQUENCE#,BYTES,MEMBERS,ARC,STATUS from v$log;

GROUP# THREAD# SEQUENCE# BYTES MEMBERS ARC STATUS
———- ———- ———- ———- ———- — —————-
1 1 464713 104857600 2 NO INACTIVE
2 1 464711 104857600 2 YES INACTIVE
3 1 464715 104857600 2 NO CURRENT
4 1 0 104857600 2 YES UNUSED
5 1 464712 104857600 2 YES INACTIVE
6 1 464714 104857600 2 NO INACTIVE
6 rows selected.

SQL> alter system switch logfile;

System altered.

SQL> /

System altered.

SQL> /

System altered.

SQL> /

System altered.

SQL> /

System altered.

SQL> /

System altered.

SQL> select GROUP#,THREAD#,SEQUENCE#,BYTES,MEMBERS,ARC,STATUS from v$log;

GROUP# THREAD# SEQUENCE# BYTES MEMBERS ARC STATUS
———- ———- ———- ———- ———- — —————-
1 1 464719 104857600 2 YES INACTIVE
2 1 464717 104857600 2 YES INACTIVE
3 1 464715 104857600 2 YES INACTIVE
4 1 464720 104857600 2 YES ACTIVE
5 1 464718 104857600 2 YES INACTIVE
6 1 464721 104857600 2 NO CURRENT

6 rows selected.
SQL> DROP TABLESPACE CUST_INDEX_JUL2007 INCLUDING CONTENTS;

Tablespace dropped.
SQL> select distinct status from v$datafile;

STATUS
——-
ONLINE
SYSTEM

After all this a full backup was taken.

A hope that helps…