Data Guard Redo Apply and Media Recovery Best Practices 10g

This morning I was requested to check a 10gR2 physical standby instance which was slow to apply redo and the lag was growing rapidly, my initial check showed me 9 to 12 minutes duration for 250 MB redo log file to apply;


Mon May  5 06:03:17 2008
Media Recovery Log /assos_arcdir/arch_rtx_1_445653769_0000902209.arc
Mon May  5 06:13:20 2008
Media Recovery Log /assos_arcdir/arch_rtx_1_445653769_0000902210.arc
Mon May  5 06:23:07 2008
Media Recovery Log /assos_arcdir/arch_rtx_1_445653769_0000902211.arc
Mon May  5 06:32:51 2008
Media Recovery Log /assos_arcdir/arch_rtx_1_445653769_0000902212.arc
Mon May  5 06:42:40 2008
Media Recovery Log /assos_arcdir/arch_rtx_1_445653769_0000902213.arc
Mon May  5 06:52:01 2008
Media Recovery Log /assos_arcdir/arch_rtx_1_445653769_0000902214.arc
Mon May  5 07:00:52 2008
Media Recovery Log /assos_arcdir/arch_rtx_1_445653769_0000902215.arc
Mon May  5 07:09:58 2008
Media Recovery Log /assos_arcdir/arch_rtx_1_445653769_0000902216.arc
..
    

After a quick look from unix SQL*Plus as sysdba the Top 5 waits were;


PX Deq: Test for msg   (4,5x amount)                                            
free buffer waits                                                  
log file sequential read                                           
PX Deq Credit: send blkd                                           
checkpoint completed                               
     

And after going through the Note:387343.1 and Data Guard Redo Apply and Media Recovery Best Practices 10gR2 I recommended to increase the buffer cache from 512 MB to 6 GB, reduce the recovery parallelism from 20 to 8 and also some additional parameters like;


*.db_cache_size=6G               
*.log_buffer=78643200
*.large_pool_size=2G 
*.pga_aggregate_target=10G
*.workarea_size_policy='AUTO'
*.parallel_min_servers=360
*.parallel_max_servers=520
*.parallel_threads_per_cpu=2
*.optimizer_mode=ALL_ROWS 
*.optimizer_dynamic_sampling=4
*.cursor_sharing=EXACT

RECOVER MANAGED STANDBY DATABASE PARALLEL 8;  
     

After the restart as I monitored the apply process from the alert.log again, now the database was able to apply 250 MB redo log file within 30 seconds and 1 minutes;


Mon May  5 09:01:07 2008
ALTER DATABASE RECOVER  automatic from '/assos_arcdir' standby database PARALLEL 8  
Mon May  5 09:01:07 2008
Media Recovery Start
Managed Standby Recovery not using Real Time Apply
 parallel recovery started with 8 processes
Mon May  5 09:01:52 2008
Media Recovery Log /assos_arcdir/arch_rtx_1_445653769_0000902217.arc
Mon May  5 09:02:29 2008
Media Recovery Log /assos_arcdir/arch_rtx_1_445653769_0000902218.arc
Mon May  5 09:03:41 2008
Media Recovery Log /assos_arcdir/arch_rtx_1_445653769_0000902219.arc
Mon May  5 09:04:39 2008
Media Recovery Log /assos_arcdir/arch_rtx_1_445653769_0000902220.arc
Mon May  5 09:05:33 2008
Media Recovery Log /assos_arcdir/arch_rtx_1_445653769_0000902221.arc
..
     

And the Top 5 waits were now;


PX Deq: Test for msg   (x amount)
log file sequential read 
checkpoint completed 
db file parallel write 
latch: checkpoint queue latch
    

So as my lessons learned, I really want to highlight two things, yes again and again these two;

1- If you have enough physical memory, benefit from it to the ends, do not run your production databases with small SGA(critical for OLTP) or PGA(critical for OLAP) especially after 9iR2(nearly no latching problem like the old days with large SGAs). Come on even on laptops we use some GBs of sized SGAs nowadays.

2- Parallelism does not mean faster, I benefit from reducing the degree of parallelism most often than compared to increasing it, 4 or 8 is nearly anytime is enough for my needs. I hope Oracle invests more in automatic parallelism tuning feature in the near future which is FALSE by default on 10gR2 and like automatic undo or automatic pga management also automatic parallel tuning takes its place in my most liked automatic tuning Oracle features :)

Here are some of the sql scripts I used during above troubleshooting from sql*plus, like in above standby case sometimes we only have sql*plus and some simple v$ queries to gather required troubleshooting information.