一、log file sync等待的原因
1. commit等待太久
(1)用户commit:v$sesstat视图的统计数据;
(2)DDL操作:递归事务提交;
(3)数据字典的递归DML操作。
2. rollback等待太久
(1)用户回滚:用户提交ROLLBACK命令
(2)事务回滚:内部回滚、空间的分配(ASSM问题)、事务中断、kill session。

二、commit与log file sync的关系
时间轴 ——————————————————–>
① ⑥
前台进程 —->| ↑—->
| |② | |
| ↓ ⑤| |
LGWR | —->| ↑—-> |
| | |③ ④| | |
| | ↓ | | |
IO | | —————————–> | |
| | | |
| | | |
| |—– log file parallel write —–| |
| |
|————— log file sync —————-|
① 用户提交COMMIT命令;
② 前台进程触发LGWR;
③ LGWR发起物理写入的系统请求;
④ 系统物理写人完成;
⑤ LGWR返回信息给前台进程;
⑥ COMMIT命令完成。

三、log file sync与磁盘IO速度的关系
时间轴 ——————————————————–>

前台进程 —->| ↑—->
| |
↓ |
LGWR —->| ↑—->
| |
↓ |
IO —————————–>
物理写的IO(log file Parallel)消耗最长时间

四、log file sync与潜在的问题
时间轴 ——————————————————–>

前台进程 —->| ↑—->
↓ ② ⑥ |
—->|③ ↑–>|
↓ ⑤ |
LGWR —-> ↑—->
| |
| ④ |
| ↑——–>
↓ |
IO ———–>

① 用户提交COMMIT命令;
② LGWR等待CPU队列;
③ LGWR提交IO请求,进入sleep状态;
④ IO完成,OS将LGWR推入CPU队列;
⑤ LGWR在CPU上运行,返回前台进程;
⑥ 前台进程获取返回信息,进入CPU队列。

五、log file sync流
1. 前台进程触发LGWR,进入sleep状态。
(1)“log file sync”等待开始;
(2)触发动作通过UNIX/LINUX操作系列信号量完成。
2. LGWR醒来,进入CPU队列。
(1)提交IO请求;
(2)LGWR进入sleep状态,“log file parallel write”等待。
3. 硬件完成IO,OS唤醒LGWR。
(1)LGWR进入CPU;
(2)标记“log file parallel write”事件完成,返回前台进程。
4. 前台进程被LGWR唤醒。
(1)前台进程进入CPU,完成“log file sync”等待。

六、Redo, commit related latches and “tuning”
1. Redo related latches
(1)redo allocation latches
– Protect allocating space in log buffer / RBA ranges in redolog stream
(2)redo copy latches
– Used only for keeping track of whether anyone’s copying data into redo log buffer
– …so that LGWR would know to wait for these memory copies to complete before it tries to write buffers to disk
– LGWR will wait for LGWR wait for redo copy wait event in such cases
– Used to be tuned by _log_simultaneous_copies
(3)Should we “tune” any of these?
– No, we should fix only problems which exist
– In other words, if wait interface doesn’t show anyone waiting for them, then don’t bother “tuning” them!

七、Instrumentation
(1)Wait Events:
– log file sync
– log file parallel write
– log file single write
(2)Performance Counters (V$SESSTAT, V$SYSSTAT)
– redo size
– redo writing time
– user commits
– user rollbacks
– transaction rollbacks

八、Wait event: log buffer space
1. (Not a commit problem)
2. LGWR is too slow flushing redo log buffer contents to disk
(1)Either because too slow IO subsystem
(2)Or LGWR not getting enough (quality) CPU time
(3)Sometimes pops up due large (unplanned) transactions
3. Of course, it can also be because of a too small log buffer
(1)Which is not the case anymore in modern days
(2)Log buffer is usually multiple MB due how it is allocated from SGA
(3)You shouldn’t even set the log_buffer parameter in 10g+

九、Wait event: log file single write
1. Single block redo IO is used mostly for logfile header block reading/writing
(1)Log switch is the main cause
(2)Archiving as well as it updates log header
(3)Who wait: LGWR & ARCH
2. Example of what LGWR does during a log switch:

WAIT #0: nam='log file sequential read' ela= 12607 log#=0 block#=1
WAIT #0: nam='log file sequential read' ela= 21225 log#=1 block#=1
WAIT #0: nam='control file sequential read' ela= 358 file#=0
WAIT #0: nam='log file single write' ela= 470 log#=0 block#=1
WAIT #0: nam='log file single write' ela= 227 log#=1 block#=1

十、Better breakdown of wait times
1. V$EVENT_HISTOGRAM
(1)Instead of a single wait time average, breaks wait times into buckets

SQL> select event, wait_time_milli,wait_count
2 from v$event_histogram
3 where event = 'log file parallel write';
EVENT WAIT_TIME_MILLI WAIT_COUNT
------------------------- --------------- ----------
log file parallel write 1 22677
log file parallel write 2 424
log file parallel write 4 141
log file parallel write 8 340
log file parallel write 16 1401
log file parallel write 32 812
log file parallel write 64 391
log file parallel write 128 21
log file parallel write 256 6

十一、Bugs, problems
1. No instrumentation
(1)On some version/platform/IO configuration the wait interface doesn’t record log file parallel write waits at all
– The same goes for db file parallel writes
– (I’ve noticed it on 9.2-10.2.0.x on Solaris for example)
(2)For LGWR you can use V$SESSTAT redo write time statistic instead
– It’s in centiseconds
2. 1-second log file sync bug
(1)Most log file syncs took ~1 second to complete
(2)The posts sent back by LGWR were missed by foreground process
(3)Thus the FG always waited until the 1 second log file sync wait timeout happened

十二、Tuning
1. No need for tuning!
(1)Log buffer is quite large by default
– All memory remaining in a granule after the allocation for fixed SGA is given to log buffer
(2)Oracle used to have a single redo log buffer until v9.0
– Redo allocation latch could become the ultimate contention point
(3)Since 9.2, Oracle can have the log buffer split into multiple buffers
– Each protected by a separate redo allocation latch
(4)From 10g, Oracle can keep lots of small private redo strands in shared pool
– Each protected by a separate redo allocation latch
– @rs.sql – Show redo strands available

十三、Evil tuning
1. If you don’t care about the D in ACID (and want to occasionally lose data for fun), then:
(1)10gR1:
– commit_logging transaction commit log write behavior
(2)10gR2:
– commit_write transaction commit log write behavior
– commit_wait transaction commit log wait behavior
(3)Old undocumented stuff
– _wait_for_sync wait for sync on commit MUST BE ALWAYS TRUE
(4)Old: Put redologs to /tmp (on Solaris) or in-memory disks (/dev/shm) for duration of a migration/upgrade
– If your OS / server crashes, you’ll need to restore from a backup!

十四、Optimizations for working around bad applications
1. Commit optimization
(1)In PL/SQL since Oracle 9i
(2)The log file sync is deferred until the end of the PL/SQL call!

SQL> exec while true loop update t set a=a+1 ; commit ; end loop;

(3)No log file sync waits
(4)log buffer space / log file switch completion waits more likely!

十五、LGWR configuration – CPU
1. Prevent priority decay
(1)Put LGWR into fixed priority scheduling class (FX60 on Solaris)
(2)LGWR should get onto CPU faster when waking up
(3)LGWR isn’t thrown off CPU as likely
2. If LGWR is still experiencing significant scheduling latency
(1)You can put LGWR into a higher priority class
3. You should not put LGWR into the highest real-time class
(1)Real time is tricky – your process can monopolize a CPU for itself
(2)You don’t want to make LGWR pre-empt the OS kernel!
(3)Note that Oracle sets some processes into higher priority by default:
_high_priority_processes LMS*|VKTM

十六、LGWR configuration – IO
1. Reduce the amount of work and waiting a “log file parallel write” has to do
(1)Unbuffered concurrent IO
– Verify with truss/strace whether proper flags are used
– (O_DIRECT, O_DIO, O_CIO etc)
(2)Or use raw devices
– ASM is essentially a raw device
(3)Or ODM for some cases
2. And optimize the whole IO hardware stack, of course!
(1)Note shat mid-large size storage arrays do have write cache built in
(2)So, moving redo log files to SSD may not give any advantage!
(3)Verify what’s your current “log file parallel write” latency using v$event_histogram

Trackback

no comment untill now

Add your comment now

切换到手机版