Hey DBA, fix it, but don’t touch it!

Martin Berger posted an interesting tweet the other day.

“what’s the expectation a DBA should do when “something is slow” – but HW and SW is ok and DBA is not responsible (and must not manipulate) schema or statement?”

The thread includes some suggestions, but I want to come at it from a different angle…

I have a lot of sympathy for this situation. In my current role I look after a lot of databases that sit behind 3rd party applications. In many cases, that means we can’t change the code. We can’t play around with the contents of the schema, for fear of losing support. Our hands are, to a greater or lesser extent, tied. This presents an interesting problem because you’re damned if you do, and damned if you don’t.

Thought 1

My attitude to databases behind 3rd party applications might surprise you. I try to stick closely to what the application vendor recommends, even if I know it to be stupid. Why?

  • Support (1) : As soon as you do something that isn’t in the vendor’s recommendations they are going to blame that for the problem. They’ll ask you to switch it all back and test it, which is just a delaying tactic most of the time and really annoying.
  • Support (2) : You want the system to be as familiar as possible to the support staff and any consultants they send on site.
  • Support (3) : I would say go for the vendor’s preferred platform, even if it’s not your preferred platform. You want to be on the same platform as the majority of their customers. If most of their customers are using Oracle on Windows, I’m going to consider it. If most of their customers are on SQL Server, that’s what I want. Unless you’re a massive customer, you are going to be at the bottom of the food chain if you marginalise yourself. Having lived through the death of Oracle on Tru64 and HP-UX, I want to be on the “main platform” for the application thank you very much!
  • Sometimes their crap application expects the DB to be crap. I had one case where the vendor’s approach to gathering stats was extremely poor. I revised it to bring it into this century and the application died. Their bad code needed bad stats to work.

At this point I expect someone to say, “But Tim, you’re supposed to be good at this stuff. I would have expected more from you!” What’s my response? Walk a mile in my shoes. If you have hours to obsess over one system to make it perfect, great for you, just don’t come and work here! 🙂

Thought 2

When you buy a product it should be fit for purchase. Part of that is the vendor should be able to give adequate guidance on getting the most out of their system. Also, when you purchase a system, you should be doing your due diligence. Having written this, I understand it is a complete fiction. Why?

  • Everything works great on PowerPoint. Procurement seems to be swayed heavily by the quality of the presentations, not the quality of the products.
  • I don’t think many people really know what they are letting themselves in for until they are so far down the line, backing out would be too embarrassing an option.
  • Sometimes, the best product on the market is absolutely terrible. I have one in mind, which my colleagues will be able to guess, where we had a choice of two products, both of which were absolutely terrible, so we picked the least terrible. In a meeting with our IT director I said, you’ve heard the expression, “You can’t polish a turd, but you can roll it in glitter!” This product is the turd inside that glitter!

Thought 3

Even when it’s not a 3rd party app, you can’t always fix it easily. As an application grows it gets significantly harder to refactor pieces of it. Sure, you can change that little bit, but what are the knock-on effects? If you have good regression tests, great. If not, it can be a risky endeavour. I mentioned in another post it took about six months to find some of the performance issues brought about by an upgrade from 11.2.0.4 to 12.1.0.2. It was stuff that had been missed during the testing and only ran once a year. The bigger the application, the easier it is for something to hide.

Even if you can refactor, do you have the time and resources to do it? Applications aren’t static. Sure, there is pressure to fix performance problems, but there is also pressure to add new functionality. What’s best for one group of people may not be good for others.

Thought 4

Martin’s tweet exemplifies the misunderstanding most managers (not him) have about databases. There’s rarely a magic button you can press that fixes performance problems. The vast majority of the time it comes down to bad database design and/or bad SQL. When you can’t change either, you don’t have much choice other than to wait for the next application patch/upgrade from the vendor that might fix it, or throw hardware at it on the DB layer or the App layer, depending where the problem is. {Insert “Run it on Exadata” comment here!} I really don’t think a lot of people outside the database world understand this!

Oracle has a lot of goodies that can be used to mitigate terrible applications without having to touch them directly. Update: I’m not trying to make out all these features don’t come with their own set of issues too. 🙂

But the real solution is to do proper database design, write good SQL, and write good applications on top of that. Performance is a development issue, and we are all developers now. Yes, even you DBAs. 🙂

Anyway, just some thoughts on the situation Martin found himself in, and I find myself in all the time. 🙂

Cheers

Tim… (Chief Presser of Magic Buttons)

No DBA Required?

Kellyn Pot’Vin-Gorman put out a nice post a few days ago which you can read here. It talks about the future of the DBA, especially in the light of Oracle’s new “fully managed” Database Cloud Service that will be announced soon. I pushed out some links to the post on social media with a “Just Read” message, as I sometimes do, then was hit by a wave of questions and comments about it. I think I’m on the same page as Kellyn where this is concerned and I’ve been saying similar things for quite some time.

I’ve also talked about how the world has changed for PL/SQL and SQL developers.

The reactions I’ve received following all these posts, as well as the comments about Kellyn’s post, can be broken down into the following categories.

  • Denial : The [Apps] DBA role will never die!
  • Panic : Quick, tell me what I should learn today before my family is out on the street.
  • Pragmatic: My role as a DBA has evolved so much over the years, and will continue to do so. I have to continue to adapt or die.

I think from my previous posts you will know I’m in the Pragmatic category. The type of work I did 20 years ago, whilst calling myself a DBA, is drastically different to what I do today. In 10 years time my role will be totally different, but I will probably still call myself a DBA (Do Bloody Anything).

At this point someone will chip in with, “We will never move our databases to the cloud so this doesn’t affect me!” This is naive for a couple of reasons. First, you will move *some* things to the cloud. It will happen! Second, the changes to the DBA role will happen regardless of the cloud. Automation is the thing that is altering the lives of DBAs and SysAdmins. Cloud is just another form of automation. If they haven’t already, your company will have to get on board with automation or die. In addition, the products you use will evolve over time, as they have been for years.

You can look at all this from a couple of angles.

  • OMG! I’m going to have to learn something new. What a bloody nightmare! I was hoping to do the exact same thing every day until I die!
  • OMG! This is brilliant! There’s loads of new stuff to learn! When I know this new stuff I’m going to be even more valuable!

Take your pick… 🙂

Cheers

Tim…

PS. It will be interesting to see what Oracle actually come up with at the end of all this… 🙂

Update: Loïc Lefèvre just sent me a link to this article, which is pretty cool!

Update 2: You might want to read this from Thomas LaRock from the SQL Server camp. 🙂

Database Administration : Dead or Alive?

I get this type of question a lot at the moment. It’s not surprising as I’ve done a few things of late that seem to have got people a bit riled up.

  • During my cloud database talks I’ve been saying things like, if you think a DBAs job is just to install, backup and patch the database, the cloud has taken your job.
  • I happened to mention the Oracle Cloud Apps DBA role does not exist. I thought I made it clear what I was saying, but a number of readers thought I was saying they shouldn’t go to work next week as they’ve been fired.
  • I’ve recently been doing some sessions with a title beginning with “Making the RDBMS relevant again…”, which suggests maybe it isn’t currently relevant.

I’ve been doing Oracle database development and DBA work for nearly 22 years. In that time the job of an Oracle DBA has changed a lot. Despite this, having people who understand what is going on below the surface has remained in demand. If you keep trying to be an old-school DBA you are going to find yourself in a very dark place very quickly. If you keep your ear to the ground and try to move with the times there will always be a role for you. Good people always land on their feet.

The way you move depends on your interests and the demands of your company. Some will move closer to an architecture role related to the infrastructure, which is pretty important when dealing with the cloud services, docker, DevOps, continuous deployment etc. Some will align themselves more closely to development, which is of greater importance in the new world. Others will completely move away from RDBMs into other technologies related to data or elsewhere.

The next question is typically, “When?” I’m not saying we should all run around screaming and pulling out our hair, but we should also not turn a blind eye to the way the world is changing. I can pretty much guarantee there will be comments by people telling me I’m wrong and the DBA role will exist forever, to which I will reply, “Denial is not just a river in Africa!” 🙂

Some companies, especially those that are more development led, will transform rapidly. DevOps, continuous deployment and technologies like Docker have the power to transform a company rapidly, whether on-prem or on the cloud. In all cases, someone needs to help build and maintain the layers that contain databases and app servers, and that could be you, but I don’t see the same volume of work we currently have, because if done properly it should be a build once, deploy many approach. For some companies that are into automation, this is already a reality. Very soon it will be true for much more of us.

Some companies will be slow in moving forward and their staff will wonder what all the fuss is about, until they apply for their next job and realise there isn’t one for them!

Having said all that, I did an “unconference” session at OOW in 2007 called “The Oracle DBA… A dying breed?” and we’re still here now. The important point is you need to take responsibility and shape your own destiny. Don’t sit idly by an watch the world take your job. It’s easier than ever to learn new things and prepare for the future, so do it! 🙂

Cheers

Tim…

The Cloud : They took our jobs!

The title is of course inspired by “They took our jobs!” from South Park.

I’ve been doing some cloud-related talks recently and a pretty regular question is, “How is this going to affect my job as a [DBA | Sysadmin]?”

My answers usually include some of the following points.

  • Back in the old days, we used to spend hours obsessing about redo and rollback/undo and sizing of the individual parts that make up the SGA and PGA. Keeping on top of some of this stuff was a full time job, even for a small number of databases. Over time Oracle have added loads of automated features that mean we don’t have to worry about this stuff for “most” of our databases. So that means less DBAs right? Not really. We are just expected to cope with a lot more stuff now. Rather than looking after 3 databases, we look after hundreds or thousands.
  • For Infrastructure as a Service (IaaS), the cloud is just a basic hosting company. You are still responsible for all system administration and database administration. A move to IaaS doesn’t affect jobs at all. If anything, it probably adds to the demand.
  • For Platform as a Service (PaaS) offerings, like Database as a Service (DBaaS), things may be different. Your level of interaction with the OS and database varies depending on the vendor, but in some cases, you will have zero access to the OS, so there is no system administration, and the level of control over the database is limited. Surely that affects jobs? Well, once again, this has just made life easier, so your company can do more stuff and you will probably be expected to do more.
  • As far as Software as a Service (SaaS) is concerned, as a customer there is no access to the infrastructure, so there is no DBA or sysadmin work. If you want to look after the guts of Fusion Apps what’s wrong with you get a job with Oracle. 🙂 Even if you don’t have access to the guts of the SaaS system, you are still going to spend a lot of time designing systems to interact with it!
  • The cloud means I no longer have to install operating systems and databases! Well, sometimes I really enjoy doing donkey work, but if you’ve not automated most of this stuff, you are really living in the dark ages. If you have automated it already, then the cloud isn’t really any different to what you are doing now.
  • What the cloud will not do is understand your custom applications and provide the skills needed to diagnose problems and advise on solutions. All the interactions with your developers and support folks will still be necessary. I can’t see a cloud service helping with this sort of stuff ever. The role of a development DBA and the crossover between functional and technical knowledge is actually far more valuable than being able to install a bit of software.

There is no doubt the cloud will affect what we as DBAs and system administrators do, but our jobs have been constantly evolving over the last couple of decades I’ve been involved in IT. As Francisco said recently, “These days, DBA stands for Database Architect”, which I think is kind-of true. A decade ago I just did Oracle databases. Now I do Oracle, SQL Server and MySQL databases. I look after WebLogic, Tomcat, IIS and Apache App/Web servers. I’m helping to set up load balancers. I get involved in infrastructure projects for applications and middleware. It’s not that I’m awesome at any of this stuff, but as a DBA and/or system administrator you get exposed to so much, which makes you an ideal resource to help with this architectural stuff.

If you think a DBA just installs Oracle, creates databases and checks backups, your job will be gone soon. If you are a system administrator that just installs operating systems and does patches, your job will be gone soon. These are trivial tasks that anyone can learn in a few weeks, so you should hardly be surprised they can be automated out of existence. If instead you concentrate on the skills where you add true value to your company, you will be in demand for a long time!

I know it’s a bit of a random post, but I hope you can see where I’m coming from! 🙂

Cheers

Tim…

The only way is automation! (update)

I was a little surprised by the reaction I got to my previous post on this subject. A number of people commented about the problems with automation and many pointed to this very appropriate comic on the subject.

There are one of two conclusions I can draw from this.

  1. My definition of automation of tasks is very much different to other people’s.
  2. It is common for DBAs and middle tier administrators to do everything by hand all the time.

I’m really hoping the answer is option 1, because I think it would be really sad if being a DBA has degenerated to the point where people spend their whole life doing tasks that could be easily scripted.

So what do I mean when I speak about automation? Most of the time I’m talking about basic scripting. Let’s take and example I went through recently, which involved cloning a database to refresh a test system from production. What did this process entail?

  • Export a couple of tables, that contain environment specific data.
  • Generate a list of ALTER USER commands to reset passwords to their original value in the test system.
  • Shutdown the test database.
  • Remove all the existing database files.
  • Create a new password file.
  • Remove the current spfile.
  • Startup the auxillary DB using a minimal init.ora file
  • Do an RMAN duplicate. In this case I used an active duplicate as the DB was relatively small. If this were a backup-based duplicate, it would have required an extra step of copying the backups using SCP to somewhere that could be seen on the test server.
  • Replace some environment-specific directory objects.
  • Unregister the old test database from recovery catalog.
  • Register the new test database with the recovery catalog.
  • Remove the old physical backups.
  • Drop all database links and recreate database links to point to the correct location for the test system.
  • Reset the passwords to their original values from the old test system.
  • Lock down all users, except those I’ve been asked to leave open.
  • Truncate and import the tables I exported at the start.

None of those tasks are difficult. It requires only a basic knowledge of shell scripting to allow me to start a single shell script and come back later to see my newly refreshed test environment.

What’s the alternative? I perform all the same tasks individually, but have to sit there waiting for each step to finish before I can move on to the next. No doubt, during this time I will be distracted by phone calls or colleagues asking me questions, which drastically increases the risk of human error.

When I talk about automation, I’m not talking about some Earth shattering AI system. I’m talking about scripting basic tasks to make myself more efficient.

At times you have to draw a line. There is no point making your automation too clever because it just becomes a rod for your own back. I’m a DBA, not a software house. This is what people are really warning about, which I did not really make clear in my first post. If something is liable to change each time you do it, you are better having a written procedure to work from, reminding you of the necessary steps and how to determine what needs to be done. You can’t become a slave to automation.

Cheers

Tim…

Detective? Crime Writer? DBA? Which are you?

The DBA role can sometimes feel like a mix of detective, crime writer and DBA all thrown together. What do I mean by that? When you hit some problems you have to play detective, trying to find clues to what is going on. Once you’ve started to gather clues, you have to play crime writer and put them together to form a coherent story that explains what is going on. Once you have the outline of your crime story you can start looking at the facts again and see if they fit with your story. If they do, your story may just be correct. If they don’t, you probably need to check the accuracy of the facts and do some rewriting of the story until the two things fit together. Once things seem to fit, you can then get busy trying to arrest the villain, or fix the problem. 🙂 This process is often necessary as part of root cause analysis.

I’ll use a recent case to highlight what I mean. I’m going to simplify things a little or the post will be too long. 🙂

Detective

We recently had some intermittent, but very severe performance problems on a system, resulting in the application effectively hanging for some time. As usual, it’s the DBA job to prove it’s not the fault of the database. 🙂 A bit of detective work came up with the following facts:

  • Moderate CPU usage on the application and DB servers.
  • Gradual reduction in free memory on both servers.
  • Over time the systems started to swap.
  • Gradual increase in number of application processes connecting to the database.
  • Lots of I/O waits on the database server.

Crime Writer

So let’s switch to crime writer and think of some stories that could explain this.

Story 1

High load on the application causes lots of database connections, eventually swamping the servers.

Does it fit the facts? The reduction in free memory, eventually leading to swap could be caused by an increase in numbers of processes on the servers. Lots of processes all doing I/O on the server will probably increase the amount of I/O waits as they fight with each other for the disks. That sounds promising.

How do we confirm it? Use the AWR reports to compare the load profiles (amongst other things) of the system compared to other times in the day, days of the week, same time and day last week etc.

Conclusion: The load on the system was not substantially different from other times when the system was running fine. The story doesn’t fit together.

Story 2

Slow I/O is reducing response time. As new requests come in, the backlog of work is getting bigger…

Does it fit the facts? In checking out the AWR reports looking for confirmation of the previous story we noticed the average times of certain physical I/O operations were about 30ms. Normally the average is <1ms. Checking the event histogram section of the AWR report, we could see a very wide spread of timings for physical I/O operations, including some in excess of 1 second. After a few minutes we started to see Warning “aiowait timed out x times” in alert.log [ID 222989.1] messages in the alert log, suggesting some of those I/O events were taking longer than 10 minutes. 🙁

Now we have a new fact to add to the previous list:

  • Some of the individual I/O waits are extremely long!

We know the application layer of this application is not too clever. If a request comes and there is an idle application process available, it will be reused. If no idle application process is available, a new one is spawned to process the request. The only limit on this is the max number of processes for that user at the OS level.

If the disk I/O is slow, application requests will take a long time to complete. If application requests are taking a long time, the likelihood of new application requests finding an idle application process to reuse is reduced, so the incidence of new processes being spawned increases. As new processes are spawned, we see a gradual increase in memory usage, leading eventually to swap. At the same time, we see an increase in database connections, which require processes, which use memory… You can see where this is going…

Conclusion: This story does seem to fit the facts. Essentially, memory, swap and the number of processes are the symptom, not the cause of the problem. The root cause could be the I/O performance issue.

Catching the Criminal

Now I said this was an intermittent problem. The next time it occurred I was prepared and watched events unfold using Luca Canali’s nifty Latency Heat Map Visualization, which prompted this quick post. The result of this was we had pretty solid evidence to suggest the problems we were experiencing were nothing to do with the application or database, but were as a result of poor I/O performance. The baton was passed to the system administrators, network and storage team to try and bottom out the problem. That process is still ongoing, but smoking gun seems to be the storage network layer.

Conclusion

Over time you build up a level of experience that allows you to spot patterns and quickly get to the root of problems. To observers it can seem almost mystical, which is why us grunt DBAs look at those database performance gurus with awe, but it all comes down to root cause analysis. The more you do this stuff, the better you get at it. Stop doing it for a while and you get rusty.

Detective? Crime Writer? DBA? Which are you? You should be all three. 🙂

Cheers

Tim…

Real DBAs use Grid Control…

Hopefully the title got your attention. Of course it could have read, “Real Linux Sysadmins use Cobbler and Puppet…”, or any number of comparable statements and products. The point being, there is a gradual evolution in the way we approach tasks and if we don’t move with them we marginalize ourselves to the point where we are so unproductive we cease to be of use.

A few years ago I was doing a lot of Linux installations and I got sick of running around with CDs, so started doing network installations to save time. I’ve been doing loads of installs on VMs at home recently, so I started doing PXE Network Installations, which saved me even more time. As a result of the article I wrote about that, Frits Hoogland pointed me in the direction of Cobbler, which makes PXE installations real easy (once you get to grips with it). I’m not a sysadmin, so why do I care? Even when I’m installing and running a handful of VMs at home I can see productivity gains by using some of these tools. Imagine the impact in a data-center!

So back to Grid Control. Does anyone remember the days when you kept a “tail -f” on your alert log? At one site I used to have a CDE workspace on an X station just running tails. Then the number of instances got too big, so I used to scan through the alert logs each day to look for issues. The next step was to use shell scripts to check for errors and mail me. This was a pain at one site where I was using Solaris, HP-UX and Windows, which meant I needed three solutions. Then the Oracle 9i Enterprise Manager with the Management Server came into my life. All of a sudden it could manage my alert logs and I could assume everything was fine ( 🙂 ) unless I got a notification email. This feature alone sold me on the 9i management server.

Back then, being a DBA and admitting using Enterprise Manager was a little like announcing to the world you were into cross dressing. 🙂 Time has moved on, the product name has changed and so has its functionality, but essentially it’s still doing the same thing, which is reducing the effort needed to manage databases (and other things). The difference is that rather than managing 40 instances, teams are now managing thousands of instances.

Of course, none of this is new. I guess it’s just been brought into focus by a few things that have happened to me recently, like the PXE/Cobbler thing, the recent demise of my Grid Control VM at home and the constant talk of cloud computing and SaaS etc.

Specialists and performance consultants have the time to obsess over minute detail. The day-to-day DBAs and sysadmins have to churn through work at a pace, with reliable and reproducible results. Failing to embrace tools, whatever they are, to aid this is career suicide.

Cheers

Tim…

The Oracle DBA… A dying breed?

I’ve updated the site notes for my OpenWorld unconference session. If you fancy having a look, it’s on the Oracle Wiki.

It’s quite hard to summarize the conversation, but I think I got the gist of it down. Remember, the notes are trying to convey the opinions of the people present. I’m not saying this was a representative sample of people, but the opinions and attititudes was certainly quite interesting to me.

Cheers

Tim…

Update: The Oracle Wiki page was removed.

It’s not simple, so don’t claim it is!

Rant Alert. The following is an unreasoned attack on the IT community in order to vent my frustration. I’m not claiming it makes any sense or it’s factually correct. It’s just how I feel today. Maybe I’ll feel different tomorrow…

I can’t help feeling that companies like Oracle are doing the IT world a major disservice by trying to make out that their products are easy to use. I have a quick newsflash… They are not!

This post is really a response to two things:

  1. My current work situation.
  2. Some of the questions I field on my forum.

From a work perspective, the mass exodus of people from my current company has left me having to deal with bits of technology that aren’t really my bag. It gets doubly annoying when I’m having to use bad support services to help me do really basic tasks. If software and hardware vendors were honest and made customers aware that they would need trained professionals to deal with this crap, perhaps people like me wouldn’t be left fumbling in the dark, trying to pick up the pieces.

From the Oracle forum side of things, I’ve really noticed a shift over the last few years and I’ve written about it before. The same type of questions are being asked as they always were. The difference is that in the past these questions were being asked by people trying to learn the technology. Now they seem to come from people who are employed as DBAs and developers by companies. I don’t believe the intellectual capacities of people have dropped over the years. I just think companies are employing under-skilled people to save money, or expecting people to cover roles they are not qualified to do. You wouldn’t let an electrician fix your plumbing, so why would you let this happen?

I don’t claim to know the answers, but I can see that the constant barrage of “point-and-click”, “intuitive” and “self-tuning” marketing messages are leading people to believe they don’t need qualified staff, and the result is a whole bunch of people asking how to recover their production databases from incomplete backups.

IT is getting more complicated and the range of skills needed in a company is getting bigger by the year. Companies need to be made to understand this or they will constantly be finding themselves in the shit!

Cheers

Tim…