Dealing with Latch and Mutex Contention
Contention is the proverbial bottleneck: when multiple database sessions compete for limited or serialized resources, the amount of work that can be done by the database is constrained. Some forms of contention are the result of programming practices: in particular, contention for locks is usually a consequence of application design. By comparison, latches and mutexes are internal Oracle mechanisms and contention for latches can be harder to diagnose and resolve.
In this chapter, we see how latches and mutexes work and why they are a necessary part of the Oracle architecture. We then discuss how to diagnose the root causes of latch and mutex contention and explore remedies to common contention scenarios.
Overview of Latch and Mutex Architecture
Anyone who has ever worked with a relational database and particularly with Oracle is probably comfortable with the principle of database locks. Locks are an essential mechanism in any transactional multiuser database system: the Atomic, Consistent, Independent, Durable (ACID) properties of a transaction can be implemented only by restricting simultaneous changes to table data. This restriction is achieved by placing locks on modified data.
Latches and mutexes are similar to locks, but instead of restricting simultaneous access to data in Oracle tables, they restrict simultaneous access to data in Oracle shared memory. A somewhat simplistic way of thinking about this is that whereas locks prevent corruption of data on disk, latches and mutexes prevent corruption of data in shared memory.
Oracle sessions share information in the buffer cache, shared pool, and other sections of the shared memory known as the system global area (SGA). It’s essential that the integrity of SGA memory is maintained, so Oracle needs a way to prevent two sessions from trying to change the same piece of shared memory at the same time. Latches and mutexes serve this purpose.
The very nature of latches and mutexes creates the potential for contention. If one session is holding a latch that is required by another session, then the sessions concerned are necessarily contending for the latch. Latch contention is therefore one of the most prevalent forms of Oracle contention.
Let’s spend a little time going over the latch and mutex implementation in Oracle before looking at specific contention scenarios.
What Are Latches?
Latches are serialization mechanisms that protect areas of Oracle’s shared memory (the SGA). In simple terms, latches prevent two processes from simultaneously updating—and possibly corrupting—the same area of the SGA.
Latches protect shared memory structures from the following situations:
- Concurrent modification by multiple sessions leading to corruption
- Data being read by one session while being modified by another session
- Data being aged out of memory while being accessed
Oracle sessions need to update or read from the SGA for almost all database operations. For example:
- When a session reads from a database file, it often stores the block into the buffer cache in the SGA. A latch is required to add the new block.
- If a block of data exists in the buffer cache, a session reads it directly from there rather than from disk. Latches are used to “lock” the buffer for a very short time while it is being accessed.
- When a new SQL statement is parsed, it is added to the library cache within the SGA. Latches or mutexes prevent two sessions from adding or changing the same SQL.
- As modifications are made to data blocks, entries are placed in a redo buffer before being written to the redo log. Access to the redo buffers is protected by redo allocation latches. Oracle maintains arrays of pointers to lists of blocks in the buffer cache. Modifications to these lists are themselves protected by latches.
Latches and mutexes prevent these operations—and many others—from interfering with each other and possibly corrupting the SGA.
Latches typically protect small groups of memory objects. For instance, each cache buffers chains latch protects a group of blocks in the buffer cache—a few dozen perhaps. However, unlike locks, which can protect even a single row, latches and mutexes almost always span multiple rows and SQL statements respectively; a single latch might protect hundreds or thousands of table rows; a single mutex might protect dozens of SQL statements.
Because the duration of operations against memory is very small (typically in the order of nanoseconds) and the frequency of memory requests potentially very high, the latching mechanism needs to be very lightweight. On most systems, a single machine instruction called test and set is used to see if the latch has already been taken (by looking at a specific memory address), and if not, it is acquired (by changing the value in the memory address). However, there may be hundreds of lines of Oracle code surrounding this “single machine instruction.”
If a latch is already in use, Oracle assumes that it will not be in use for long, so rather than go into a passive wait (relinquish the CPU and go to sleep), Oracle might retry the operation a number of times before giving up and sleeping. This algorithm is called acquiring a spin lock. Each attempt to obtain the latch is referred to as a latch get, each failure is a latch miss, and sleeping after spinning on the latch is a latch sleep.
A session can awaken from a sleep in one of two ways. Either the session awakens automatically after a period of time (a timer sleep), or it can be awoken when the latch becomes available. In modern releases of Oracle, latches are generally woken by a signal rather than after waiting for a fixed amount of time. The session that waits places itself on the latch wait list. When another session is relinquishing the latch in question, it looks at the latch wait list and sends a signal to the sleeping session indicating that the latch is now available. The sleeping session immediately wakes up and tries to obtain the latch.
Historically, all latches would repeatedly attempt to acquire a latch before relinquishing. Because latches are held for extremely short periods of time, it can make more sense to stay on the CPU and keep trying rather than to surrender the CPU and force a relatively expensive context switch. The process of repeatedly attempting to acquire the latch is known as spinning.
Some latches must be acquired exclusively, while others may be acquired in shared read mode. The shareable latches may still be acquired in exclusive mode should the Oracle code determine that shared access is not appropriate.
In modern Oracle (11g and 12c), attempts to acquire a latch in exclusive mode normally result in 20,000 spin attempts before going onto the latch wait list. In other circumstances (such as an exclusive mode get on a shareable latch), the process may spin only 2,000 times or (for shared mode requests, for example) spin only a couple of times or not at all.
Most of the high-volume latch requests are made in exclusive mode, so most of the time a latch miss results in 20,000 spin gets before a latch sleep occurs.
What Are Mutexes?
Originally, all Oracle shared memory serialization mechanisms were referred to as latches. Beginning in Oracle Database 10g, some of the mechanisms were described as mutexes—so what’s the difference, and does it matter?
In computer science, a mutex (MUTual EXclusion) is defined as a mechanism that prevents two processes from simultaneously accessing a critical section of code or memory.
Oracle latches in fact represent an implementation of the mutex pattern, and nobody would have argued had Oracle originally referred to them as mutexes. Regardless of why Oracle originally decided to describe the mechanisms as latches, over time other database vendors have followed suit, and today a latch could arguably be defined as “a mutex mechanism implemented within a database server.”
Although there’s no definitive difference between latches and mutexes, in practice what Oracle calls mutexes are implemented by more fundamental operating system calls that have an even lower memory and CPU overhead than a latch. The primary advantage of mutexes is that there can be more of them, which allows each mutex to protect a smaller number of objects as compared to a latch.
Latch and Mutex Internals
Originally, only the developers of the Oracle software truly understood latching mechanisms, but over the years many smart people have studied and experimented on latches and mutexes. Through their work, we have come to understand at least some of these mechanisms.
Way back in 1999, Steve Adams pioneered much research into latch algorithms and published them in a small but classic book Oracle8i Internal Services (O’Reilly, 1999). This book reflected our best understanding of how latches worked in the Oracle 8i release. However, the mechanisms have changed substantially in every release of Oracle, and today the writings of Andrey Nikolaev at http://andreynikolaev.wordpress.com probably represent our most modern understanding of latch internals.
There was a time when it was possible to have a fairly complete understanding of latch internals without being a member of Mensa. However, today the various mechanisms have become so complex and changeable that probably only a handful of people outside of Oracle Corporation (and maybe inside) have a complete grasp of the mechanisms. The rest of us are just hurting our brains trying to keep up with it all!
Luckily, it’s not necessary to understand the details of latch/mutex algorithms. The root causes of latch contention typically remain constant, even while the internal algorithms are continuously being tweaked, and the solutions almost always involve alleviating these root causes rather than tweaking the internal algorithms. Those root causes generally relate to multiple Oracle sessions competing for access to memory structures in the SGA.