Trabajo con Latches
What Are Latches?
Latches are serialization mechanisms that protect areas of Oracle’s shared memory (the
SGA). In simple terms, latches prevent two processes from simultaneously updating — and
possibly corrupting — the same area of the SGA.
Oracle sessions need to update or read from the SGA for almost all database operations. For
instance:
• When a session reads a block from disk, it must modify a free block in the buffer
cache and adjust the buffer cache LRU chain1.
• When a session reads a block from the SGA, it will modify the LRU chain.
• When a new SQL statement is parsed, it will be added to the library cache within the
SGA.
• As modifications are made to blocks, entries are placed in the redo buffer.
• The database writer periodically writes buffers from the cache to disk (and must
update their status from “dirty” to “clean”).
• The redo log writer writes entries from the redo buffer to the redo logs.
Latches prevent any of these operations from colliding and possibly corrupting the SGA.
How Latches Work
Because the duration of operations against memory is very small (typically in the order of
nanoseconds) and the frequency of latch requests very high, the latching mechanism needs to
be very lightweight. On most systems, a single machine instruction called “test and set” is
used to see if the latch is taken (by looking at a specific memory address) and if not, acquire it
(by changing the value in the memory address).
If the latch is already in use, Oracle can assume that it will not be in use for long, so rather
than go into a passive wait (e.g., relinquish the CPU and go to sleep) Oracle will retry the
operation a number of times before giving up and going to passive wait. This algorithm is
called acquiring a spinlock and the number of “spins” before sleeping is controlled by the
Oracle initialization parameter “_spin_count”.
The first time the session fails to acquire the latch by spinning, it will attempt to awaken after
10 milliseconds. Subsequent waits will increase in duration and in extreme circumstances
may exceed one second. In a system suffering from intense contention for latches, these waits
will have a severe impact on response time and throughput.
1 The LRU (Least Recently Used) chain records how often individual blocks have been accessed. If a
block is not accessed it moves towards the LRU end of the list and eventually is flushed from the cache.
Causes of contention for specific latches
The latches that most frequently affect performance are those protecting the buffer cache,
areas of the shared pool and the redo buffer.
• Library cache latches: These latches protect the library cache in which sharable
SQL is stored. In a well defined application there should be little or no contention for
these latches, but in an application that uses literals instead of bind variables (for
instance “WHERE surname=’HARRISON’” rather that “WHERE
surname=:surname,” library cache contention is common.
• Redo copy/redo allocation latches: These latches protect the redo log buffer,
which buffers entries made to the redo log. Recent improvements (from Oracle 7.3
onwards) have reduced the frequency and severity of contention for these latches.
• Shared pool latches: These latches are held when allocations or de-allocations of
memory occur in the shared pool. Prior to Oracle 8.1.7, the most common cause of
shared pool latch contention was an overly large shared pool and/or failure to make
use of the reserved area of the shared pool2.
• Cache buffers chain latches: These latches are held when sessions read or write to
buffers in the buffer cache. In Oracle8i, there are typically a very large number of
these latches each of which protects only a handful of blocks. Contention on these
latches is typically caused by concurrent access to a very “hot” block and the most
common type of such a hot block is an index root or branch block (since any index
based query must access the root block).
Measuring Latch Contention
Ratio-based Techniques
Conventional wisdom in the mid-90s was to focus on the latch “miss” rate to determine the
degree of latch contention.
We see that for each latch, the number of gets (requests for the latch), misses (number of
times the first request fails) and sleeps (number of times a session failed to obtain a latch by
spinning) are recorded. In the past, queries such as the following were often used to determine
latch health:
SQL> select name, gets, misses, misses*100/gets misspct from v$latch where gets>0;
NAME GETS MISSES MISSPCT
---------------------------------------- ----------- ----------- -------
latch wait list 32 1 3.13
process allocation 28 0 .00
session allocation 1,223,068 84 .01
session switching 8,009 0 .00
process group creation 40 0 .00
session idle bit 2,426,940 1 .00
shared java pool 1,188 0 .00
event group latch 28 0 .00
messages 2,128,851 461 .02
enqueues 3,168,279 7 .00
enqueue hash chains 1,747,312 2 .00
channel handle pool latch 40 0 .00
channel operations parent latch 158 3 1.90
message pool operations parent latch 3 0 .00
file number translation table 946 0 .00
mostly latch-free SCN 1 0 .00
cache buffers lru chain 343,827 0 .00
active checkpoint queue latch 174,293 0 .00
checkpoint queue latch 1,459,929 26 .00
cache buffers chains 13,851,179 16,254 .12
cache buffer handles 61 0 .00
This approach was flawed on a number of levels:
• It is actually the number of sleeps that most accurately influences the impact of the
latch contention on response time.
• A high miss rate is expected for certain latches.
• A latch with a high miss rate (or sleep rate) that is not frequently accessed is probably
not impacting performance.
• Even if a latch is experiencing a high sleep rate, we can’t determine the impact on
performance without taking into account waits for other resources. So if sessions are
waiting 90% for IO, 8% for CPU and 2% for latch, expending effort on halving the
latch sleep wait only provides a 1% improvement in response time – probably not
noticeable.
In the above example the “latch wait list” latch has the highest miss rate. However, this is
totally irrelevant since it was only requested 26 times, while the “cache buffer chains” latch
appears to have only a moderate miss rate, but has been requested almost three million times
and — as we shall see — is the latch most affecting performance.
Wait interface-based techniques
A better approach to estimating the impact of latch contention is to consider the relative
amount of time being spent waiting for latches. The following query gives us some indication
of this:
SELECT event, time_waited,
round(time_waited*100/ SUM (time_waited) OVER(),2) wait_pct
FROM (SELECT event, time_waited
FROM v$system_event
WHERE event NOT IN
('Null event',
'client message',
'rdbms ipc reply',
'smon timer',
'rdbms ipc message',
'PX Idle Wait',
'PL/SQL lock timer',
'file open',
'pmon timer',
'WMON goes to sleep',
'virtual circuit status',
'dispatcher timer',
'SQL*Net message from client',
'parallel query dequeue wait',
'pipe get'
) UNION
(SELECT NAME, VALUE
FROM v$sysstat
WHERE NAME LIKE 'CPU used when call started'))
ORDER BY 2 DESC
EVENT TIME_WAITED WAIT_PCT
------------------------------ ----------- ----------
latch free 40144 31.67
CPU used when call started 30341 23.94
control file sequential read 12341 9.74
direct path read 11933 9.41
control file parallel write 6487 5.12
file identify 5666 4.47
log file sync 3492 2.75
log file parallel write 3213 2.53
instance state change 3064 2.42
log file switch completion 3049 2.41
db file sequential read 2290 1.81
Now we can look at the sleeps in v$latch to determine which latches are likely to be
contributing most to this problem:
SQL> select name, gets, sleeps,
sleeps*100/sum(sleeps) over() sleep_pct, sleeps*100/gets sleep_rate
from v$latch where gets>0
order by sleeps desc;
NAME GETS SLEEPS SLEEP_PCT SLEEP_RATE
------------------------------ ----------- ------------ --------- ----------
cache buffers chains 13,863,552 38,071 99.48 .2746
session allocation 1,223,982 110 .29 .0090
checkpoint queue latch 1,461,039 39 .10 .0027
library cache 9,239,751 22 .06 .0002
shared pool 869,652 16 .04 .0018
messages 2,130,515 6 .02 .0003
redo writing 1,330,987 6 .02 .0005
latch wait list 33 0 .00 .0000
session switching 8,014 0 .00 .0000
session idle bit 2,428,851 0 .00 .0000
enqueues 3,171,018 0 .00 .0000
channel handle pool latch 40 0 .00 .0000
message pool operations parent 3 0 .00 .0000
latch
mostly latch-free SCN 1 0 .00 .0000
Now we are in a position to make some reasonable conclusions:
• Latch sleeps contribute to about 30% of database response time (very excessive),
AND
• It’s the cache buffers chains latches that contributes to the vast majority of these
waits.
Note that if we had used the conventional “ratio based” analysis outlined in the previous
section we would have discounted cache buffers chains latches as a problem because the miss
rate was “only” 0.15%.
Tuning the Application to Avoid Latch Contention
There are some things we can do within our application design that can reduce contention for
latches.
Using Bind Variables
As noted earlier, failure to use bind variables within an application is the major cause of
library cache latch contention. All Oracle applications should make use of bind variables
whenever possible.
However, all is not lost if you are unable to modify your application code. From 8.1.6
onwards you can use the “CURSOR_SHARING” parameter to cause Oracle to modify SQL
on the fly to use bind variables. A setting of FORCE causes all literals to be converted to bind
variables. A setting of SIMILAR causes statements to be rewritten only if it would not cause
the statements execution plan today (which can happen if there are histogram statistics
defined on a column referenced in the WHERE clause).
Avoiding Hot Blocks
Cache buffers chains latch contention is one of the most intractable types of latch contention.
There are a couple of things you can do at the application level to reduce the severity of this
type of contention.
Firstly, identify the blocks that are “hot.” Metalink note 163424.1, “How to Identify a Hot
Block Within The Database” describes how to do this. Having identified the identity of the
hot block, you will most likely find that it is an index root or branch block. If this is the case,
there are two application design changes that may help.
1) Consider partitioning the table and using local indexes. This might allow you to
spread the heat amongst multiple indexes (you will probably want to use a hash
partition to ensure an even spread of load amongst the partitions).
2) Consider converting the table to a hash cluster keyed on the columns of the index.
This allows the index to be bypassed completely and may also result in some other
performance improvements.
Comentarios