The only way is automation! (update)

I was a little surprised by the reaction I got to my previous post on this subject. A number of people commented about the problems with automation and many pointed to this very appropriate comic on the subject.

There are one of two conclusions I can draw from this.

  1. My definition of automation of tasks is very much different to other people’s.
  2. It is common for DBAs and middle tier administrators to do everything by hand all the time.

I’m really hoping the answer is option 1, because I think it would be really sad if being a DBA has degenerated to the point where people spend their whole life doing tasks that could be easily scripted.

So what do I mean when I speak about automation? Most of the time I’m talking about basic scripting. Let’s take and example I went through recently, which involved cloning a database to refresh a test system from production. What did this process entail?

  • Export a couple of tables, that contain environment specific data.
  • Generate a list of ALTER USER commands to reset passwords to their original value in the test system.
  • Shutdown the test database.
  • Remove all the existing database files.
  • Create a new password file.
  • Remove the current spfile.
  • Startup the auxillary DB using a minimal init.ora file
  • Do an RMAN duplicate. In this case I used an active duplicate as the DB was relatively small. If this were a backup-based duplicate, it would have required an extra step of copying the backups using SCP to somewhere that could be seen on the test server.
  • Replace some environment-specific directory objects.
  • Unregister the old test database from recovery catalog.
  • Register the new test database with the recovery catalog.
  • Remove the old physical backups.
  • Drop all database links and recreate database links to point to the correct location for the test system.
  • Reset the passwords to their original values from the old test system.
  • Lock down all users, except those I’ve been asked to leave open.
  • Truncate and import the tables I exported at the start.

None of those tasks are difficult. It requires only a basic knowledge of shell scripting to allow me to start a single shell script and come back later to see my newly refreshed test environment.

What’s the alternative? I perform all the same tasks individually, but have to sit there waiting for each step to finish before I can move on to the next. No doubt, during this time I will be distracted by phone calls or colleagues asking me questions, which drastically increases the risk of human error.

When I talk about automation, I’m not talking about some Earth shattering AI system. I’m talking about scripting basic tasks to make myself more efficient.

At times you have to draw a line. There is no point making your automation too clever because it just becomes a rod for your own back. I’m a DBA, not a software house. This is what people are really warning about, which I did not really make clear in my first post. If something is liable to change each time you do it, you are better having a written procedure to work from, reminding you of the necessary steps and how to determine what needs to be done. You can’t become a slave to automation.

Cheers

Tim…

Detective? Crime Writer? DBA? Which are you?

The DBA role can sometimes feel like a mix of detective, crime writer and DBA all thrown together. What do I mean by that? When you hit some problems you have to play detective, trying to find clues to what is going on. Once you’ve started to gather clues, you have to play crime writer and put them together to form a coherent story that explains what is going on. Once you have the outline of your crime story you can start looking at the facts again and see if they fit with your story. If they do, your story may just be correct. If they don’t, you probably need to check the accuracy of the facts and do some rewriting of the story until the two things fit together. Once things seem to fit, you can then get busy trying to arrest the villain, or fix the problem. :) This process is often necessary as part of root cause analysis.

I’ll use a recent case to highlight what I mean. I’m going to simplify things a little or the post will be too long. :)

Detective

We recently had some intermittent, but very severe performance problems on a system, resulting in the application effectively hanging for some time. As usual, it’s the DBA job to prove it’s not the fault of the database. :) A bit of detective work came up with the following facts:

  • Moderate CPU usage on the application and DB servers.
  • Gradual reduction in free memory on both servers.
  • Over time the systems started to swap.
  • Gradual increase in number of application processes connecting to the database.
  • Lots of I/O waits on the database server.

Crime Writer

So let’s switch to crime writer and think of some stories that could explain this.

Story 1

High load on the application causes lots of database connections, eventually swamping the servers.

Does it fit the facts? The reduction in free memory, eventually leading to swap could be caused by an increase in numbers of processes on the servers. Lots of processes all doing I/O on the server will probably increase the amount of I/O waits as they fight with each other for the disks. That sounds promising.

How do we confirm it? Use the AWR reports to compare the load profiles (amongst other things) of the system compared to other times in the day, days of the week, same time and day last week etc.

Conclusion: The load on the system was not substantially different from other times when the system was running fine. The story doesn’t fit together.

Story 2

Slow I/O is reducing response time. As new requests come in, the backlog of work is getting bigger…

Does it fit the facts? In checking out the AWR reports looking for confirmation of the previous story we noticed the average times of certain physical I/O operations were about 30ms. Normally the average is <1ms. Checking the event histogram section of the AWR report, we could see a very wide spread of timings for physical I/O operations, including some in excess of 1 second. After a few minutes we started to see Warning “aiowait timed out x times” in alert.log [ID 222989.1] messages in the alert log, suggesting some of those I/O events were taking longer than 10 minutes. :(

Now we have a new fact to add to the previous list:

  • Some of the individual I/O waits are extremely long!

We know the application layer of this application is not too clever. If a request comes and there is an idle application process available, it will be reused. If no idle application process is available, a new one is spawned to process the request. The only limit on this is the max number of processes for that user at the OS level.

If the disk I/O is slow, application requests will take a long time to complete. If application requests are taking a long time, the likelihood of new application requests finding an idle application process to reuse is reduced, so the incidence of new processes being spawned increases. As new processes are spawned, we see a gradual increase in memory usage, leading eventually to swap. At the same time, we see an increase in database connections, which require processes, which use memory… You can see where this is going…

Conclusion: This story does seem to fit the facts. Essentially, memory, swap and the number of processes are the symptom, not the cause of the problem. The root cause could be the I/O performance issue.

Catching the Criminal

Now I said this was an intermittent problem. The next time it occurred I was prepared and watched events unfold using Luca Canali’s nifty Latency Heat Map Visualization, which prompted this quick post. The result of this was we had pretty solid evidence to suggest the problems we were experiencing were nothing to do with the application or database, but were as a result of poor I/O performance. The baton was passed to the system administrators, network and storage team to try and bottom out the problem. That process is still ongoing, but smoking gun seems to be the storage network layer.

Conclusion

Over time you build up a level of experience that allows you to spot patterns and quickly get to the root of problems. To observers it can seem almost mystical, which is why us grunt DBAs look at those database performance gurus with awe, but it all comes down to root cause analysis. The more you do this stuff, the better you get at it. Stop doing it for a while and you get rusty.

Detective? Crime Writer? DBA? Which are you? You should be all three. :)

Cheers

Tim…

Real DBAs use Grid Control…

Hopefully the title got your attention. Of course it could have read, “Real Linux Sysadmins use Cobbler and Puppet…”, or any number of comparable statements and products. The point being, there is a gradual evolution in the way we approach tasks and if we don’t move with them we marginalize ourselves to the point where we are so unproductive we cease to be of use.

A few years ago I was doing a lot of Linux installations and I got sick of running around with CDs, so started doing network installations to save time. I’ve been doing loads of installs on VMs at home recently, so I started doing PXE Network Installations, which saved me even more time. As a result of the article I wrote about that, Frits Hoogland pointed me in the direction of Cobbler, which makes PXE installations real easy (once you get to grips with it). I’m not a sysadmin, so why do I care? Even when I’m installing and running a handful of VMs at home I can see productivity gains by using some of these tools. Imagine the impact in a data-center!

So back to Grid Control. Does anyone remember the days when you kept a “tail -f” on your alert log? At one site I used to have a CDE workspace on an X station just running tails. Then the number of instances got too big, so I used to scan through the alert logs each day to look for issues. The next step was to use shell scripts to check for errors and mail me. This was a pain at one site where I was using Solaris, HP-UX and Windows, which meant I needed three solutions. Then the Oracle 9i Enterprise Manager with the Management Server came into my life. All of a sudden it could manage my alert logs and I could assume everything was fine ( :) ) unless I got a notification email. This feature alone sold me on the 9i management server.

Back then, being a DBA and admitting using Enterprise Manager was a little like announcing to the world you were into cross dressing. :) Time has moved on, the product name has changed and so has its functionality, but essentially it’s still doing the same thing, which is reducing the effort needed to manage databases (and other things). The difference is that rather than managing 40 instances, teams are now managing thousands of instances.

Of course, none of this is new. I guess it’s just been brought into focus by a few things that have happened to me recently, like the PXE/Cobbler thing, the recent demise of my Grid Control VM at home and the constant talk of cloud computing and SaaS etc.

Specialists and performance consultants have the time to obsess over minute detail. The day-to-day DBAs and sysadmins have to churn through work at a pace, with reliable and reproducible results. Failing to embrace tools, whatever they are, to aid this is career suicide.

Cheers

Tim…

The Oracle DBA… A dying breed?

I’ve updated the site notes for my OpenWorld unconference session. If you fancy having a look, it’s on the Oracle Wiki.

It’s quite hard to summarize the conversation, but I think I got the gist of it down. Remember, the notes are trying to convey the opinions of the people present. I’m not saying this was a representative sample of people, but the opinions and attititudes was certainly quite interesting to me.

Cheers

Tim…

It’s not simple, so don’t claim it is!

Rant Alert. The following is an unreasoned attack on the IT community in order to vent my frustration. I’m not claiming it makes any sense or it’s factually correct. It’s just how I feel today. Maybe I’ll feel different tomorrow…

I can’t help feeling that companies like Oracle are doing the IT world a major disservice by trying to make out that their products are easy to use. I have a quick newsflash… They are not!

This post is really a response to two things:

  1. My current work situation.
  2. Some of the questions I field on my forum.

From a work perspective, the mass exodus of people from my current company has left me having to deal with bits of technology that aren’t really my bag. It gets doubly annoying when I’m having to use bad support services to help me do really basic tasks. If software and hardware vendors were honest and made customers aware that they would need trained professionals to deal with this crap, perhaps people like me wouldn’t be left fumbling in the dark, trying to pick up the pieces.

From the Oracle forum side of things, I’ve really noticed a shift over the last few years and I’ve written about it before. The same type of questions are being asked as they always were. The difference is that in the past these questions were being asked by people trying to learn the technology. Now they seem to come from people who are employed as DBAs and developers by companies. I don’t believe the intellectual capacities of people have dropped over the years. I just think companies are employing under-skilled people to save money, or expecting people to cover roles they are not qualified to do. You wouldn’t let an electrician fix your plumbing, so why would you let this happen?

I don’t claim to know the answers, but I can see that the constant barrage of “point-and-click”, “intuitive” and “self-tuning” marketing messages are leading people to believe they don’t need qualified staff, and the result is a whole bunch of people asking how to recover their production databases from incomplete backups.

IT is getting more complicated and the range of skills needed in a company is getting bigger by the year. Companies need to be made to understand this or they will constantly be finding themselves in the shit!

Cheers

Tim…