The DBA role can sometimes feel like a mix of detective, crime writer and DBA all thrown together. What do I mean by that? When you hit some problems you have to play detective, trying to find clues to what is going on. Once you’ve started to gather clues, you have to play crime writer and put them together to form a coherent story that explains what is going on. Once you have the outline of your crime story you can start looking at the facts again and see if they fit with your story. If they do, your story may just be correct. If they don’t, you probably need to check the accuracy of the facts and do some rewriting of the story until the two things fit together. Once things seem to fit, you can then get busy trying to arrest the villain, or fix the problem. 🙂 This process is often necessary as part of root cause analysis.
I’ll use a recent case to highlight what I mean. I’m going to simplify things a little or the post will be too long. 🙂
We recently had some intermittent, but very severe performance problems on a system, resulting in the application effectively hanging for some time. As usual, it’s the DBA job to prove it’s not the fault of the database. 🙂 A bit of detective work came up with the following facts:
- Moderate CPU usage on the application and DB servers.
- Gradual reduction in free memory on both servers.
- Over time the systems started to swap.
- Gradual increase in number of application processes connecting to the database.
- Lots of I/O waits on the database server.
So let’s switch to crime writer and think of some stories that could explain this.
High load on the application causes lots of database connections, eventually swamping the servers.
Does it fit the facts? The reduction in free memory, eventually leading to swap could be caused by an increase in numbers of processes on the servers. Lots of processes all doing I/O on the server will probably increase the amount of I/O waits as they fight with each other for the disks. That sounds promising.
How do we confirm it? Use the AWR reports to compare the load profiles (amongst other things) of the system compared to other times in the day, days of the week, same time and day last week etc.
Conclusion: The load on the system was not substantially different from other times when the system was running fine. The story doesn’t fit together.
Slow I/O is reducing response time. As new requests come in, the backlog of work is getting bigger…
Does it fit the facts? In checking out the AWR reports looking for confirmation of the previous story we noticed the average times of certain physical I/O operations were about 30ms. Normally the average is <1ms. Checking the event histogram section of the AWR report, we could see a very wide spread of timings for physical I/O operations, including some in excess of 1 second. After a few minutes we started to see Warning “aiowait timed out x times” in alert.log [ID 222989.1] messages in the alert log, suggesting some of those I/O events were taking longer than 10 minutes. 🙁
Now we have a new fact to add to the previous list:
- Some of the individual I/O waits are extremely long!
We know the application layer of this application is not too clever. If a request comes and there is an idle application process available, it will be reused. If no idle application process is available, a new one is spawned to process the request. The only limit on this is the max number of processes for that user at the OS level.
If the disk I/O is slow, application requests will take a long time to complete. If application requests are taking a long time, the likelihood of new application requests finding an idle application process to reuse is reduced, so the incidence of new processes being spawned increases. As new processes are spawned, we see a gradual increase in memory usage, leading eventually to swap. At the same time, we see an increase in database connections, which require processes, which use memory… You can see where this is going…
Conclusion: This story does seem to fit the facts. Essentially, memory, swap and the number of processes are the symptom, not the cause of the problem. The root cause could be the I/O performance issue.
Catching the Criminal
Now I said this was an intermittent problem. The next time it occurred I was prepared and watched events unfold using Luca Canali’s nifty Latency Heat Map Visualization, which prompted this quick post. The result of this was we had pretty solid evidence to suggest the problems we were experiencing were nothing to do with the application or database, but were as a result of poor I/O performance. The baton was passed to the system administrators, network and storage team to try and bottom out the problem. That process is still ongoing, but smoking gun seems to be the storage network layer.
Over time you build up a level of experience that allows you to spot patterns and quickly get to the root of problems. To observers it can seem almost mystical, which is why us grunt DBAs look at those database performance gurus with awe, but it all comes down to root cause analysis. The more you do this stuff, the better you get at it. Stop doing it for a while and you get rusty.
Detective? Crime Writer? DBA? Which are you? You should be all three. 🙂