Why Automation Sucks

I’ve written a number of posts about how important automation is (here), but thought I would mention something that happened on Friday…

If you’ve followed the blog you will know I recently released a hands-off installation of 18 RAC. On Friday I received a pull request from GitHub suggesting a change to the GI software installation, switching the public and private network device names from “enp0s8” & “enp0s9” to “eth1” & “eth2”. I hadn’t seen ethernet device names like this since RHEL5/OL5, so I was convinced it was a mistake. I was in the middle of writing a comment on the pull request, but I thought I better do my homework, rather than making an assumption.

When I got home I fired up the existing RAC on my Linux server and the device names were “enp0s8” & “enp0s9” as I expected. I fired up a new DNS server on my laptop and the public network device name was “enp0s8”. At this point I was convinced I was correct, but I thought I better just make sure. I was using the latest VirtualBox and Vagrant releases, but I noticed the output from the build said there was a newer version of the “bento/oracle-7.5” Vagrant box. Did a “vagrant box update”, “vagrant destroy -f” and “vagrant up” and the new DNS server was built with a public network device name of “eth1”.

I went back to my Linux server, did the same and the device names were eth1 and eth2 on there too.ย So theย “bento/oracle-7.5” box update had caused the ethernet device naming to change. As a result of this I accepted the pull request, made a similar change to another config file, then destroyed and started a rebuild of my RAC. I went out to visit some mates and watched the first two episodes of Ozark season 2. By the time I came back I had a new RAC up and running.

Why does this mean automation sucks? Well it doesn’t really, but it does show you how your automation can be broken by things outside of your control. That could be change to a Vagrant box, a Docker image, the Kickstart file from your system admins, or even a cloud service. You have to keep on top of this stuff. You can’t just define it and forget about it. Having said that, I would rather find a problem like this in an automated build than in a manual process… ๐Ÿ™‚

Cheers

Tim…

 

Why Automation Matters : You’re only a tweak away!

Once you start on the automation path it becomes progressively easier to automate new things because you will build up a collection of stuff you can tweak to create the new stuff. Here’s an example…

A little while ago I did a hands-off build of 18c RAC using VirtualBox and Vagrant (here). I had to solve a few little problems, but for the most part it was piecing together a bunch of stuff I already had, like silent installations and database creations, so no big drama. Probably the most complicated thing was deciding how I wanted to organise things, which I’m sure will change over time…

Fast forward to a few days ago when I wanted to play around with 18c Data Guard. I actually took the RAC build and used that as the basis of this little project. Obviously some things were chopped out and some things were added, but a lot of it was just reused, which saved a bunch of thinking and hassle.

Once I had the 18c build working, a couple of changes in the config files and I had a 12cR2 build up and running (here). Some config file changes and a couple of minor scripting changes and I had a 12cR1 build up and running. You get the picture.

Of course you will occasionally have to do something that constitutes a step change, or you will decide to take a completely new approach* and have to go back to basics, but a lot of the time you are only a tweak away from the next automation.

Cheers

Tim…

* I’m playing around with Ansible at the moment, so maybe I’ll end up redoing these using Ansible. Maybe not. We’ll see. ๐Ÿ™‚

Why Automation Matters : ITIL

ITILย is quite a divisive subject in the geek world. Once the subject is raised most of us geeks start channelling our inner cowboy/cowgirl thinking we don’t need the shackles of a formal process, because we know what we are doing and don’t make mistakes. Once something goes wrong everyone looks around saying, “I didn’t do anything!”

Despite how annoying it can seem at times, you need something like ITIL for a couple of reasons:

  • It’s easy to be blinkered. I see so many people who can’t see beyond their own goals, even if that means riding roughshod over other projects and the needs of the business. You need something in place to control this.
  • You need a paper trail. As soon as something goes wrong you need to know what’s changed. If you ask people you will hear a resounding chorus of “I’ve not changed anything!”, sometimes followed by, “… except…”. It’s a lot easier to get to the bottom issues if you know exactly what has happened and in what order.

So what’s this got to do with automation? The vast majority of ITIL related tasks I’m forced to do should be invisible to me. Imagine the build and deployments of a new version of an application to a development server. The process might look like this.

  • Someone requests a new deployment manually, or it is done automatically on a schedule or triggered by a commit.
  • A new deployment request is raised.
  • The code is pulled from source control.
  • The build is completed and result of the build recorded in the deployment request.
  • Automated testing is used to test the new build. Let’s assume it’s all successful for the rest of the list. The results of the testing are recorded in the deployment request.
  • Artifacts from the build are stored in some form of artefact store.
  • The newly built application is deployed to the application server.
  • The result of the deployment is recorded in the deployment request.
  • Any necessary changes to the CMDB are recorded.
  • The deployment request is closed as successful.

None of those tasks require a human. For a development server the changes are all pre-approved, and all the ITIL “work” is automated, so you have a the full paper trail, even for your development servers.

It’s hard to be annoyed by ITIL if most of it is invisible to you! ๐Ÿ™‚

IMHO the biggest problem with ITIL is bad implementation. Over complication, emphasis on manual operations and lack of continuous improvement. If ITIL is hindering your progress you are doing it wrong. The same could be said about lots of things. ๐Ÿ™‚ One way of solving this is to automate the problem out of existence.

Cheers

Tim…

Why Automation Matters : Patching and Upgrading

As I said in a recent post, you know you are meant to, but you donโ€™t. Why not?

The reasons will vary a little depending on the tech you are using, but I’ll divide this answer into two specific parts. The patch/upgrade process itself and testing.

The Patch/Upgrade Process

I’ve lived through the bad old days of Oracle patching and upgrades and it was pretty horrific. In comparison things are a lot better these days, but they are still not what they should be in my opinion.ย I can script patches and upgrades, but I shouldn’t have to.ย  I’m sure this will get some negative feedback, but I think people need to stop navel gazing and see how simple some other products are to deal with. I’ll stop there…

That said, I don’t think patches and upgrades are actually the problem. Of course you have to be careful about limiting down time, but much of the this is predictable and can be mitigated.

One of the big problems is the lack of standardisation within a company. When every system is unique, automating a patch or upgrade procedure can become problematic. You have to include too much logic in the automation, which can make the automation a burden. What the cloud has taught us is you should try to standardise as much as possible. When everything most things are the same, scripting and automation gets a lot easier. How do you guarantee things confirm to a standard? You automate the initial build process. ๐Ÿ™‚

So if you automate your build process, you actually make automating your patch/upgrade process easier too. ๐Ÿ™‚

The app layer is a lot simpler than the database layer, because it’s far easier to throw away and replace an application layer, which is what people aim to do nowadays.

Testing

Testing is usually the killer part of the patch/upgrade process. I can patch/upgrade anything without too much drama, but getting someone to test it and agree to moving it forward is a nightmare. Spending time to test a patch is always going to lose out in the war for attention if there is a new spangly widget or screen needed in the application.

This is where automation can come to the rescue. If you have automated testing not only can you can move applications through the development pipeline quicker, but you can also progress infrastructure changes, such as patches and upgrades, much quicker too, as there will be a greater confidence in the outcome of the process.

Conclusion

Patching and upgrades can’t be considering in isolation where automation is concerned. It doesn’t matter how quick and reliably you can patch a database or app server if nobody is ever going to validate it is safe to progress to the next level.

I’m not saying don’t automate patching and upgrades, you definitely should. What I’m saying is it might not deliver on the promise of improved roll-out speed as a chain is only as strong as the weakest link. If testing is the limiting factor in your organisation, all you are doing by speeding up your link in the chain is adding to the testing burden down the line.

Having said all that, at least you will know you stuff is going to work and you can spend your time focusing on other stuff, like maybe helping people sort out their automated testing… ๐Ÿ™‚

Cheers

Tim…

Why Automation Matters : Reliability and Confidence

In my previous post on this subject I mentioned the potential for human error in manual processes. This leads nicely into the subject of this post about reliability and confidence…

I’ve been presenting at conferences for over a decade. Right from the start I included live demos in those talks. For a couple of years I avoided them to make my life simpler, but I’ve moved back to them again as I feel in some cases showing something has a bigger impact than just saying it…

The Problem

One of the stressful things about live demos is they require something to run the demo on, and what happens if that’s not in the state you expect it to be?

I had an example of this a few years ago. I was in Bulgaria doing a talk aboutย CloneDB and someone asked me a question at the end of the session, so I trashed my demo to allow me to show the answer to their question. I forgot to correct the situation, so when I came to do the same demo at UKOUG it went horribly wrong, which lead someone on Twitter to say “session clone db is a mess“, and they were correct. It was.ย The problem here was I wasn’t starting from a known state…

This is no different for us developers and DBAs out in the real world. When we are given some kit, we want to know it’s in a consistent state, but it might not be for a few reasons.

Human Error

The system was created using a manual build process and someone made a mistake. I think almost every system coming out of a manual process has something screwed on it. I make mistakes like this too. The phone rings, you get distracted and you come back to the original task and you forget a step. You can minimise this with recipes and checklists, but we are human. We will goof up, regardless of the measures we put in place.

Sometimes it’s easy to find and fix the issue. Sometimes you have to step through the whole process again to identify the issue. For complex builds this can take a long time, and that’s all wasted time.

Changes During the Lifespan

The delivered system was perfect, but then it was changed during its lifespan. Here are a couple of examples.

App Server: Someone is diagnosing an issue and they change some app server parameters and forget to set them back. Those don’t fix the current issue, but they do affect the outcome of the next test. Having completed the testing successfully, the application gets moved to production and fails, because UAT and Live no longer have the same environment, so the outcomes are not comparable or predictable.

Database: Several developers are using a shared development database. Each person is trying to shape the data to fit their scenario, and in the process trashing someone else’s work. The shared database is only refreshed a handful of times a year, so these inconsistencies linger for a long time. If the setup of test data is not done carefully you can add logical corruptions to the data, making it no longer representative of a real situation. Once again the outcomes are not comparable or predicable.

The Solution?

I guess from the title you already know this. Automation.

Going back to my demo problem again, I almost had a repeat of this scenario at Oracle Code: Bangalore a few months ago. I woke up the day of the conference and did a quick run through my demos and something wasn’t working. How did I solve it? I rebuilt everything. ๐Ÿ™‚

I do most of my demos using Docker these days, even for non-Docker stuff. I use Oracle Linux 7 and UEK4 as my base OS and kernel, so I run Docker inside a VirtualBox VM. The added bonus is I get a consistent experience regardless of underlying host OS (Windows, macOS or Linux). So what did the rebuild involve? From my laptop I just ran these commands.

vagrant destroy -f
vagrant up

I subsequently connected to the resulting VM and ran this command to build and run the specific containers for my demo.

docker-compose up

What I was left with was a clean build in exactly the condition I needed it to be to do my demos. Now I’m not saying I wasn’t nervous, because not having working demos on the morning of the conference is a nerve wracking thing, but I knew I could get back to a steady state, so this whole issue resulted in one line in the blog post for that day. ๐Ÿ™‚ Without automation I would be trying to find and fix the problem, or manually rebuilding everything under time pressure, which is a sure fire way to make mistakes.

I do some demos on Oracle Database Cloud Service too. When I recently switched between trial accounts my demo VM was lost, so I provisioned a new 18c DBaaS, uploaded a script and ran it. Setup complete.

Confidence

Automation is quicker. I think we all get that. Having a reliable build process means you have the confidence to throw stuff away and build clean at any point. Think about it.

  • Developers replacing their whole infrastructure whenever they want. At a minimum once per sprint.
  • Deployments to environments not just deploying code, but replacing the infrastructure with it.
  • Environments fired up for a single purpose, maybe some automated QA or staff training, then destroyed.
  • When something goes wrong in production, just replace it. You know it’s going to work because it did in all your other environments.

Having reliable automation brings with it a greater level of confidence in what you are delivering, so you can spend less time on unplanned work fixing stuff and focus more on delivering value to the business.

Tooling

The tooling you choose will depend a lot on what you are doing and what your preferences are. For example, if you are focusing on the RDBMS layer, it is unlikely you will choose Docker for anything other than little demos. For some 3rd party software it’s almost impossible to automate a build process, so you might use gold images as your starting point or partially automate the process. In some cases you might use the cloud to provide the automation for you. The tooling is less important than the mindset in my opinion.

Cheers

Tim…

Why Automation Matters : Lost Time

Sorry for stating what I hope is the obvious, but automation matters. It’s mattered for a long time, but the constant mention of Cloud and DevOps over the last few years has thrown even more emphasis on automation.

If you are not sure why automation matters, I would just like to give you an example of the bad old days, which might be the current time for some who are still doing everything manually, with separate teams responsible for each stage of the process.

Lost Time : Handover/Handoff Lag

In the diagram below we can see all the stages you might go through to deploy a new application server. Every time the colour of the box changes, it means a handover to a different team.

So there are a few things to consider here.

  • Each team is likely to have different priorities, so a handover between teams is not necessarily instantaneous. The next stage may be waiting on a queue for a long time. Potentially days. Don’t even get me started on things waiting for people to return from holiday…
  • Even if an individual team has created build scripts and has done their best to automate their tasks, if it is relying on them to pick something off a queue to initiate it, there will still be a handover delay.
  • When things are done manually people make mistakes. It doesn’t matter how good the people are, they will mess up occasionally. That is why the diagram includes a testing failure, and the process being redirected back through several teams to diagnose and fix the issue. This results in even more work. Specifically, unplanned work.
  • Manual processes are just slower. Running an installer and clicking the “Next” button a few times takes longer than just running a script. If you have to type responses and make choices it’s going to take even more time, and don’t forget that bit about human error…

Let’s contrast this to the “perfect” automated setup, where the request triggers an automated process to deliver the new service.

In this example, the request initiates an automated workflow that completes the action and delivers the finished product without any human intervention along the path. The automation takes as long as it takes, and ultimately has to do most of the same work, but there is no added handover lag in this process.

I think it’s fair to say you would be expecting a modern version of this process to complete in a matter of minutes, but I’ve seen the manual process take weeks or even months, not because of “work time”, but because of the idle handover time and human processes involved…

They Took Our Jobs!

At first glance it might seem like this is a problem if you are employed in any of the teams responsible for doing the manual tasks. Surely the automation is going to mean job cuts right? That depends really. In order to fully automate the delivery of any service you are going to have to design and build the blocks that will be threaded together to produce the final solution. This is not always simple. Depending on your current setup this might mean having fewer, more highly skilled people, or it might require more people in total. It’s impossible to know without knowing the requirements and the current staffing levels. Also, cloud provides a lot of the building blocks for you, so if you go that route there may be less work to do in total.

Even if the number of people doesn’t change as part of the automation process, you are getting work through the door quicker, so you are adding value to the business at a higher rate. From a DevOps perspective you have not added value to the business until you’ve delivered something to them. All the hours spent getting part of the build done equate to zero value to the business…

But we are doing OK without automation!

No you’re not! You’re drowning! You just don’t know it yet!

I never hear people saying they haven’t got enough projects waiting. I always hear people saying they have to shelve things because they don’t have time staff/resources/time to do them.

As your processes get more efficient you should be able to reallocate staff to projects that add value to the business, rather than wasting their lives on clicking the “Next” button.

If your process stays inefficient you will always be saying you are short of staff and every new project will require yet another round of internal recruitment or outsourcing.

Is this DevOps?

I’m hesitant to use the term DevOps as it can be a bit of a divisive term. I struggle to see how anyone who understands DevOps can’t see the benefits, but I think many people don’t know what it means, and without the understanding the word is useless…

Certainly automation is one piece of the DevOps puzzle, but equally if you have company resistance to the term DevOps, feel free to ignore it and focus on trying to sell the individual benefits of DevOps, one of which is improved automation…

Cheers

Tim…

ODC Appreciation Day : Silent Installation and Configuration (Automation) : #ThanksODC

Here is my entry for the Oracle Developer Communityย ODC Appreciation Day (#ThanksODC).

I’ve been mentioning automation a lot recently, both in relation to the cloud and on-prem. The OpenWorld announcements about the Autonomous Database service are not the first thing Oracle has done to ease automation of repetitive tasks. In fact, Oracle has quite a long history of making automation of installation and configuration easy.

I’m not sure what version introduced silent installations of the database, but I first wrote about them when using Oracle 9i (here), with the article changing a lot over the years. In addition to making installations faster, more repeatable and less error prone, they are also important these days if you are using a cloud provider for Infrastructure as a Service (IaaS), since using X emulation to perform tasks can be super-slow. Over the years I’ve also written about silent installations of WebLogic, Oracle Forms, ODI and OBIEE to name but a few.

In addition to installations, Oracle has made silent configuration possible too. Running the Database Configuration Assistant (DBCA) in silent mode is pretty simple (here). WebLogic Scripting Tool (WLST) is a not always easy, but it is a really powerful way to script build processes for WebLogic servers (here). If you are using Enterprise Manager Cloud Control, you will find an API for pretty much everything, allowing you to script using EMCLI (here).

You can find a number of articles I’ve written related to silent installation and configuration using the links above, or grouped under this section of my website.

A good knowledge of this subject is important if you want to start checking out Docker, because you will be doing silent builds and configuration for everything.

When you are learning something new it is nice to use GUI screens, as they often feel a little simpler at first and sometimes give you a little more context about what you are doing. Once you’ve covered the basics you should really switch to scripting, as it will make you more efficient. When I first started to manage WebLogic servers I resisted the switch to using WLST for quite some time. It seemed a little complicated and I was in denial until Lonneke Dikmans persuaded me to try it. Once I got into it I never looked back! ๐Ÿ™‚

To summarise the advantages of scripting your installations and configuration, they are:

  • Faster.
  • More reliable.
  • More repeatable.
  • Work fine on the cloud and in Docker.
  • Easily maintainable and can be version controlled.

If you’re not using this stuff already, do yourself a favour and give it a go. You will thank yourself!

Cheers

Tim…

The only way is automation! (update)

I was a little surprised by the reaction I got to my previous post on this subject. A number of people commented about the problems with automation and many pointed to this very appropriate comic on the subject.

There are one of two conclusions I can draw from this.

  1. My definition of automation of tasks is very much different to other people’s.
  2. It is common for DBAs and middle tier administrators to do everything by hand all the time.

I’m really hoping the answer is option 1, because I think it would be really sad if being a DBA has degenerated to the point where people spend their whole life doing tasks that could be easily scripted.

So what do I mean when I speak about automation? Most of the time I’m talking about basic scripting. Let’s take and example I went through recently, which involved cloning a database to refresh a test system from production. What did this process entail?

  • Export a couple of tables, that contain environment specific data.
  • Generate a list of ALTER USER commands to reset passwords to their original value in the test system.
  • Shutdown the test database.
  • Remove all the existing database files.
  • Create a new password file.
  • Remove the current spfile.
  • Startup the auxillary DB using a minimal init.ora file
  • Do an RMAN duplicate. In this case I used an active duplicate as the DB was relatively small. If this were a backup-based duplicate, it would have required an extra step of copying the backups using SCP to somewhere that could be seen on the test server.
  • Replace some environment-specific directory objects.
  • Unregister the old test database from recovery catalog.
  • Register the new test database with the recovery catalog.
  • Remove the old physical backups.
  • Drop all database links and recreate database links to point to the correct location for the test system.
  • Reset the passwords to their original values from the old test system.
  • Lock down all users, except those I’ve been asked to leave open.
  • Truncate and import the tables I exported at the start.

None of those tasks are difficult. It requires only a basic knowledge of shell scripting to allow me to start a single shell script and come back later to see my newly refreshed test environment.

What’s the alternative? I perform all the same tasks individually, but have to sit there waiting for each step to finish before I can move on to the next. No doubt, during this time I will be distracted by phone calls or colleagues asking me questions, which drastically increases the risk of human error.

When I talk about automation, I’m not talking about some Earth shattering AI system. I’m talking about scripting basic tasks to make myself more efficient.

At times you have to draw a line. There is no point making your automation too clever because it just becomes a rod for your own back. I’m a DBA, not a software house. This is what people are really warning about, which I did not really make clear in my first post. If something is liable to change each time you do it, you are better having a written procedure to work from, reminding you of the necessary steps and how to determine what needs to be done. You can’t become a slave to automation.

Cheers

Tim…

The only way is automation!

Basically I’m a lazy person with a short attention span, so when I’ve done something once, I get kind-of bored of doing it again and again. As a result, automation is a perfect solution to me. Figure out how to do something, script it so I can repeat it easily, then move on! *

In my current role I’ve been a little sloppy about automating things. In part, this is because I’ve been doing such a random variety of things it’s been quite hard to see the patterns of repetitive tasks and it’s been difficult to find the time to actually automate the things I have spotted. The fact I’m currently the only DBA in an organisation that believes DBA stands for Do Bloody Anything, means I’ve been slowing sinking into the weeds of late. Over the last week or two I’ve drawn a mental line in the sand and decided I won’t do anything without scripting it or automating it in some fashion. It requires a lot of discipline to do that when it means potentially missing deadlines, but it really is the only way to work.

Most of what I’ve been doing is standard stuff. Scripting cloning procedures, so I can refresh an environment by starting a script and leaving it to do it’s thing. Making sure developers can get their own application server logs without having to ask me. Making sure applications can be deployed by nominated people without my intervention. Hopefully over the coming weeks I will be able to get things to a state where I can actually see some light at the end of the tunnel.

Cheers

Tim…

* It seems people take everything so literally. ๐Ÿ™‚ This does not mean I only ever do something once. The, “figure out how to do something”, part is a process of investigation and testing, not just some random crap that gets thrown together in a script. This is true of any process, manual or scripted. ๐Ÿ™‚

Real DBAs use Grid Control…

Hopefully the title got your attention. Of course it could have read, “Real Linux Sysadmins use Cobbler and Puppet…”, or any number of comparable statements and products. The point being, there is a gradual evolution in the way we approach tasks and if we don’t move with them we marginalize ourselves to the point where we are so unproductive we cease to be of use.

A few years ago I was doing a lot of Linux installations and I got sick of running around with CDs, so started doing network installations to save time. I’ve been doing loads of installs on VMs at home recently, so I started doing PXE Network Installations, which saved me even more time. As a result of the article I wrote about that, Frits Hoogland pointed me in the direction of Cobbler, which makes PXE installations real easy (once you get to grips with it). I’m not a sysadmin, so why do I care? Even when I’m installing and running a handful of VMs at home I can see productivity gains by using some of these tools. Imagine the impact in a data-center!

So back to Grid Control. Does anyone remember the days when you kept a “tail -f” on your alert log? At one site I used to have a CDE workspace on an X station just running tails. Then the number of instances got too big, so I used to scan through the alert logs each day to look for issues. The next step was to use shell scripts to check for errors and mail me. This was a pain at one site where I was using Solaris, HP-UX and Windows, which meant I needed three solutions. Then the Oracle 9i Enterprise Manager with the Management Server came into my life. All of a sudden it could manage my alert logs and I could assume everything was fine ( ๐Ÿ™‚ ) unless I got a notification email. This feature alone sold me on the 9i management server.

Back then, being a DBA and admitting using Enterprise Manager was a little like announcing to the world you were into cross dressing. ๐Ÿ™‚ Time has moved on, the product name has changed and so has its functionality, but essentially it’s still doing the same thing, which is reducing the effort needed to manage databases (and other things). The difference is that rather than managing 40 instances, teams are now managing thousands of instances.

Of course, none of this is new. I guess it’s just been brought into focus by a few things that have happened to me recently, like the PXE/Cobbler thing, the recent demise of my Grid Control VM at home and the constant talk of cloud computing and SaaS etc.

Specialists and performance consultants have the time to obsess over minute detail. The day-to-day DBAs and sysadmins have to churn through work at a pace, with reliable and reproducible results. Failing to embrace tools, whatever they are, to aid this is career suicide.

Cheers

Tim…