Why Automation Matters : Keep Your Auditors Happy

We were having some of our systems audited recently. I’ve been part of this sort of things a few times over the years, but I was pleasantly surprised by a number of the questions that were being asked during this most recent session. I’ll paraphrase some of their questions and my answers.

  • How do you document your build processes? We have silent build scripts (where possible). The same build scripts are used for each build, with the differences just being environment variables. If a silent build is not possible, we do a semi-silent build, and use screen grabs for the manual bits.
  • How do you keep control of your builds and configuration? Everything goes into a cloud-based Git repository, and we have a local git server as a backup of the cloud service.
  • How do you manage change through your systems? Requests, Incidents, Enhancements, Tasks are raised and placed in a Task Board, which is kind-of like a Kanban board, in Service Now. Progression of changes to production require a Change Request (CR), which may need to be agreed by the Change Advisory Board (CAB), depending on the nature of the change.
  • Are changes applied manually, or using automation? This was followed by a long discussion about what we can and can’t automate because of our internal company structure and politics. It also covered the differences between automation of changes to infrastructure and in the development process. πŸ™‚

There was a lot more than this, but this is enough to make my point.

The reactions to the answers can be summarised as follows.

  • When we had a repeatable automated process we got a thumbs up.
  • When we had a process that was semi-automated, because full automation was impractical (because of additional constraints), we got a thumbs up.
  • When we had a manual process, we got a thumbs down, because maintaining consistency and preventing human error is really hard when using manual processes.

In a sentence I guess I could say, if you are using DevOps you pass. If you are not using DevOps you fail. πŸ™‚

Now I am coming to this with a certain level of bias in favour of DevOps, and that bias may be skewing my interpretation of the situation somewhat, but that is how it felt to me.

As I said earlier, I was pleasantly surprised by this angle. It’s nice to see the auditors giving me some extra leverage, and it certainly feels like automation is a good way to keep the auditors happy! πŸ™‚

Check out the rest of the series here.

Cheers

Tim…

PS. This is just one part of the whole auditing process.

Autonomous Database : “Hand-tuning doesn’t scale”

I was at a talk by Chris Thalinger at Oracle Code One called β€œPerformance tuning Twitter services with Graal and machine learning”. One of the things he said was, “Hand-tuning doesn’t scale”, and it brought into focus some of the things that have been going on in the Autonomous Database, which is closer to my world. πŸ™‚

In my post called It’s not all about you! I discussed the reaction to a new feature mentioned in the ACE Director briefing. It has been spoken about publicly now, so I guess I’m allowed to mention it by name. The feature in question was Automatic Index Tuning that (insert Safe Harbour slide) might be in Oracle 19c, or in an autonomous database cloud service in the future. Once this feature was mentioned, the list of questions started to pile up, before we even knew what it was or how it was implemented. I mentioned my own reaction to this specific feature, but let’s look at this in the broader sense of autonomous services generally.

As I mentioned, watching Chris’ session brought all this into focus for me. Sorry if I’m stating the obvious, but here goes.

  • Even if I were capable of doing a better job than an automatic performance tuning feature, and I’m not sure I can, that is just me. Is everyone else I work with at my level of understanding or better? Is everyone else who works with the database across the world at my level of understanding or better? If the answer to that is no, then there is a need for feature X, whatever it is.
  • Let’s say I have a group of really skilled people that can do better than automatic feature X. Are they constantly looking at the system, trying to get the best performance possible, or are they working on hundreds or thousands of different targets, and actually spending very little time on each? As their workload grows, which it invariably will, will they be able to spend more or less time looking at each specific feature?

I know there are some consultants that get to go in and solve specific problems on specific systems, and maybe those folks will look down on automatic performance tuning features, but I have to look after loads of disparate systems and I get 30 seconds to get something done before I have to move on. I like to think I’m pretty good at Oracle database stuff, but I need all the help I can get if I want to keep things running smoothly.

When a new automatic feature is announced we always get super intense about it, which usually results in a lot of wailing and gnashing of teeth. Sometimes this is for very good reason, as the early incarnations of some features have been problematic, but over time they often become the norm. Think about the following, and what life would be like without them…

For some people reading this, they may never have experienced life without these features. Believe me, it wasn’t pretty! πŸ™‚

Whether it’s a specific automatic feature, like Automatic Index Tuning, or a grander vision, like the Autonomous Database family of cloud services, this is part of the natural evolution of the database. At *some point* in the future I can see all my databases running on the cloud and all of them being some form of autonomous service, regardless of which cloud provider is running them.

Check out the rest of the series here.

Cheers

Tim…

PS. I hope people understand the spirit of what I’m saying, but I feel the need to include a few statements, as some people on Twitter seemed to get the wrong end of the stick.

  • I’m not saying you can do a rubbish job and leave it up to an automatic tuning feature to fix your crap application. Bad software always runs badly, no matter what you do with it. You might be able to mask some of the problems, but you don’t fix them.
  • I’m not suggesting the development process shouldn’t include proper testing, including unit, integration, UAT and performance testing. See previous point.
  • The more you know about your platform, the better job you can do, even if you have automatic features to help you.

ORDS, SQLcl and SQL Developer 18.3 Updates (VirtualBox, Vagrant, Docker)

A few days ago we got version 18.3 of a bunch of Oracle tools.

Over the weekend I updated some of my VirtualBox andΒ Vagrant builds to include these versions. If you want to play around with them you can see them on GitHub here.

I also updated my ORDS Docker container build, which uses both ORDS and SQLcl. You can find this on GitHub here.

I use this container for live demos of ORDS, as well as a demo for my “DBA Does Docker” talk, which I am doing at Oracle OpenWorld this year.

I put the latest versions of SQL Developer and SQLcl on my laptop. I’m doing an analytic functions talk at Oracle Code One this year. The demos use SQLcl on my laptop connecting to Autonomous Transaction Proccessing (ATP) on Oracle Cloud. I had a little bit of drama with SQLcl on Saturday, which turned out to be PEBCAK. I thought “SET ECHO ON” wasn’t working, but it turned out I had a “login.sql” file in the path that contained “SET TERMOUT OFF”. Once I removed that setting the demos ran fine. πŸ™‚

I’m going to put a freeze on changing my stuff until after OpenWorld and Code One. Honest. πŸ™‚

Cheers

Tim…

VirtualBox and Vagrant : New RAC Stuff and Changes

There were a lot of changes in my Vagrant repository on GitHub last week and over the weekend.

First, I got asked a question about 12.2 RAC and I couldn’t be bothered to run through a manual build, so I took my 18c RAC hands-off build and amended it to create a 12.2 RAC hands-off build. Along the way I noticed a couple of hard-coded bits in the 18c build I hadn’t noticed previously, which I altered of course. I also had to move the 18c build to a version-specific sub-directory. I think I’ve altered all references to the location.

I went through some of my individual server builds and updated them to use the latest versions of Tomcat 9, Java 11 and APEX 18.2. All that was pretty straight forward.

On Sunday I was running some tests of the builds on my laptop while I was at my brother’s house, and I noticed I was not pulling packages from the yum repositories properly. I ended up adding “nameserver 8.8.8.8” to pretty much all the “/etc/resolv.conf” files inside the VMs. I’m not sure what has changed as that hasn’t happened before, so I’m not sure if it’s something to do with the networking… Anyway, it fixed everything, so happy days.

While I was doing these builds I learned something new. I forgot to amend the path to my ASM disks from a UNIX style path “/u05/VirtualBox/shared/ol7_183_rac/…” to a Windows style path. Vagrant didn’t care and just created the location under the C drive as “C:\u05\VirtualBox\shared\ol7_183_rac\”. I’ll have to add a note about that to my “README.md” files about that.

I’ve still got to update some Docker builds with the latest software. I’ll probably do that over this week…

Cheers

Tim…

 

Why Automation Sucks

I’ve written a number of posts about how important automation is (here), but thought I would mention something that happened on Friday…

If you’ve followed the blog you will know I recently released a hands-off installation of 18 RAC. On Friday I received a pull request from GitHub suggesting a change to the GI software installation, switching the public and private network device names from “enp0s8” & “enp0s9” to “eth1” & “eth2”. I hadn’t seen ethernet device names like this since RHEL5/OL5, so I was convinced it was a mistake. I was in the middle of writing a comment on the pull request, but I thought I better do my homework, rather than making an assumption.

When I got home I fired up the existing RAC on my Linux server and the device names were “enp0s8” & “enp0s9” as I expected. I fired up a new DNS server on my laptop and the public network device name was “enp0s8”. At this point I was convinced I was correct, but I thought I better just make sure. I was using the latest VirtualBox and Vagrant releases, but I noticed the output from the build said there was a newer version of the “bento/oracle-7.5” Vagrant box. Did a “vagrant box update”, “vagrant destroy -f” and “vagrant up” and the new DNS server was built with a public network device name of “eth1”.

I went back to my Linux server, did the same and the device names were eth1 and eth2 on there too. So the “bento/oracle-7.5” box update had caused the ethernet device naming to change. As a result of this I accepted the pull request, made a similar change to another config file, then destroyed and started a rebuild of my RAC. I went out to visit some mates and watched the first two episodes of Ozark season 2. By the time I came back I had a new RAC up and running.

Why does this mean automation sucks? Well it doesn’t really, but it does show you how your automation can be broken by things outside of your control. That could be change to a Vagrant box, a Docker image, the Kickstart file from your system admins, or even a cloud service. You have to keep on top of this stuff. You can’t just define it and forget about it. Having said that, I would rather find a problem like this in an automated build than in a manual process… πŸ™‚

Check out the rest of the series here.

Cheers

Tim…

Why Automation Matters : You’re only a tweak away!

Once you start on the automation path it becomes progressively easier to automate new things because you will build up a collection of stuff you can tweak to create the new stuff. Here’s an example…

A little while ago I did a hands-off build of 18c RAC using VirtualBox and Vagrant (here). I had to solve a few little problems, but for the most part it was piecing together a bunch of stuff I already had, like silent installations and database creations, so no big drama. Probably the most complicated thing was deciding how I wanted to organise things, which I’m sure will change over time…

Fast forward to a few days ago when I wanted to play around with 18c Data Guard. I actually took the RAC build and used that as the basis of this little project. Obviously some things were chopped out and some things were added, but a lot of it was just reused, which saved a bunch of thinking and hassle.

Once I had the 18c build working, a couple of changes in the config files and I had a 12cR2 build up and running (here). Some config file changes and a couple of minor scripting changes and I had a 12cR1 build up and running. You get the picture.

Of course you will occasionally have to do something that constitutes a step change, or you will decide to take a completely new approach* and have to go back to basics, but a lot of the time you are only a tweak away from the next automation.

Check out the rest of the series here.

Cheers

Tim…

* I’m playing around with Ansible at the moment, so maybe I’ll end up redoing these using Ansible. Maybe not. We’ll see. πŸ™‚

Why Automation Matters : ITIL

ITIL is quite a divisive subject in the geek world. Once the subject is raised most of us geeks start channelling our inner cowboy/cowgirl thinking we don’t need the shackles of a formal process, because we know what we are doing and don’t make mistakes. Once something goes wrong everyone looks around saying, “I didn’t do anything!”

Despite how annoying it can seem at times, you need something like ITIL for a couple of reasons:

  • It’s easy to be blinkered. I see so many people who can’t see beyond their own goals, even if that means riding roughshod over other projects and the needs of the business. You need something in place to control this.
  • You need a paper trail. As soon as something goes wrong you need to know what’s changed. If you ask people you will hear a resounding chorus of “I’ve not changed anything!”, sometimes followed by, “… except…”. It’s a lot easier to get to the bottom issues if you know exactly what has happened and in what order.

So what’s this got to do with automation? The vast majority of ITIL related tasks I’m forced to do should be invisible to me. Imagine the build and deployments of a new version of an application to a development server. The process might look like this.

  • Someone requests a new deployment manually, or it is done automatically on a schedule or triggered by a commit.
  • A new deployment request is raised.
  • The code is pulled from source control.
  • The build is completed and result of the build recorded in the deployment request.
  • Automated testing is used to test the new build. Let’s assume it’s all successful for the rest of the list. The results of the testing are recorded in the deployment request.
  • Artifacts from the build are stored in some form of artefact store.
  • The newly built application is deployed to the application server.
  • The result of the deployment is recorded in the deployment request.
  • Any necessary changes to the CMDB are recorded.
  • The deployment request is closed as successful.

None of those tasks require a human. For a development server the changes are all pre-approved, and all the ITIL “work” is automated, so you have a the full paper trail, even for your development servers.

It’s hard to be annoyed by ITIL if most of it is invisible to you! πŸ™‚

IMHO the biggest problem with ITIL is bad implementation. Over complication, emphasis on manual operations and lack of continuous improvement. If ITIL is hindering your progress you are doing it wrong. The same could be said about lots of things. πŸ™‚ One way of solving this is to automate the problem out of existence.

Check out the rest of the series here.

Cheers

Tim…

Why Automation Matters : Patching and Upgrading

As I said in a recent post, you know you are meant to, but you don’t. Why not?

The reasons will vary a little depending on the tech you are using, but I’ll divide this answer into two specific parts. The patch/upgrade process itself and testing.

The Patch/Upgrade Process

I’ve lived through the bad old days of Oracle patching and upgrades and it was pretty horrific. In comparison things are a lot better these days, but they are still not what they should be in my opinion. I can script patches and upgrades, but I shouldn’t have to.  I’m sure this will get some negative feedback, but I think people need to stop navel gazing and see how simple some other products are to deal with. I’ll stop there…

That said, I don’t think patches and upgrades are actually the problem. Of course you have to be careful about limiting down time, but much of the this is predictable and can be mitigated.

One of the big problems is the lack of standardisation within a company. When every system is unique, automating a patch or upgrade procedure can become problematic. You have to include too much logic in the automation, which can make the automation a burden. What the cloud has taught us is you should try to standardise as much as possible. When everything most things are the same, scripting and automation gets a lot easier. How do you guarantee things conform to a standard? You automate the initial build process. πŸ™‚

So if you automate your build process, you actually make automating your patch/upgrade process easier too. πŸ™‚

The app layer is a lot simpler than the database layer, because it’s far easier to throw away and replace an application layer, which is what people aim to do nowadays.

Testing

Testing is usually the killer part of the patch/upgrade process. I can patch/upgrade anything without too much drama, but getting someone to test it and agree to moving it forward is a nightmare. Spending time to test a patch is always going to lose out in the war for attention if there is a new spangly widget or screen needed in the application.

This is where automation can come to the rescue. If you have automated testing not only can you can move applications through the development pipeline quicker, but you can also progress infrastructure changes, such as patches and upgrades, much quicker too, as there will be a greater confidence in the outcome of the process.

Conclusion

Patching and upgrades can’t be considering in isolation where automation is concerned. It doesn’t matter how quick and reliably you can patch a database or app server if nobody is ever going to validate it is safe to progress to the next level.

I’m not saying don’t automate patching and upgrades, you definitely should. What I’m saying is it might not deliver on the promise of improved roll-out speed as a chain is only as strong as the weakest link. If testing is the limiting factor in your organisation, all you are doing by speeding up your link in the chain is adding to the testing burden down the line.

Having said all that, at least you will know your stuff is going to work and you can spend your time focusing on other stuff, like maybe helping people sort out their automated testing… πŸ™‚

Check out the rest of the series here.

Cheers

Tim…

Why Automation Matters : Reliability and Confidence

In my previous post on this subject I mentioned the potential for human error in manual processes. This leads nicely into the subject of this post about reliability and confidence…

I’ve been presenting at conferences for over a decade. Right from the start I included live demos in those talks. For a couple of years I avoided them to make my life simpler, but I’ve moved back to them again as I feel in some cases showing something has a bigger impact than just saying it…

The Problem

One of the stressful things about live demos is they require something to run the demo on, and what happens if that’s not in the state you expect it to be?

I had an example of this a few years ago. I was in Bulgaria doing a talk about CloneDB and someone asked me a question at the end of the session, so I trashed my demo to allow me to show the answer to their question. I forgot to correct the situation, so when I came to do the same demo at UKOUG it went horribly wrong, which lead someone on Twitter to say “session clone db is a mess“, and they were correct. It was. The problem here was I wasn’t starting from a known state…

This is no different for us developers and DBAs out in the real world. When we are given some kit, we want to know it’s in a consistent state, but it might not be for a few reasons.

Human Error

The system was created using a manual build process and someone made a mistake. I think almost every system coming out of a manual process has something screwed on it. I make mistakes like this too. The phone rings, you get distracted and you come back to the original task and you forget a step. You can minimise this with recipes and checklists, but we are human. We will goof up, regardless of the measures we put in place.

Sometimes it’s easy to find and fix the issue. Sometimes you have to step through the whole process again to identify the issue. For complex builds this can take a long time, and that’s all wasted time.

Changes During the Lifespan

The delivered system was perfect, but then it was changed during its lifespan. Here are a couple of examples.

App Server: Someone is diagnosing an issue and they change some app server parameters and forget to set them back. Those don’t fix the current issue, but they do affect the outcome of the next test. Having completed the testing successfully, the application gets moved to production and fails, because UAT and Live no longer have the same environment, so the outcomes are not comparable or predictable.

Database: Several developers are using a shared development database. Each person is trying to shape the data to fit their scenario, and in the process trashing someone else’s work. The shared database is only refreshed a handful of times a year, so these inconsistencies linger for a long time. If the setup of test data is not done carefully you can add logical corruptions to the data, making it no longer representative of a real situation. Once again the outcomes are not comparable or predicable.

The Solution?

I guess from the title you already know this. Automation.

Going back to my demo problem again, I almost had a repeat of this scenario at Oracle Code: Bangalore a few months ago. I woke up the day of the conference and did a quick run through my demos and something wasn’t working. How did I solve it? I rebuilt everything. πŸ™‚

I do most of my demos using Docker these days, even for non-Docker stuff. I use Oracle Linux 7 and UEK4 as my base OS and kernel, so I run Docker inside a VirtualBox VM. The added bonus is I get a consistent experience regardless of underlying host OS (Windows, macOS or Linux). So what did the rebuild involve? From my laptop I just ran these commands.

vagrant destroy -f
vagrant up

I subsequently connected to the resulting VM and ran this command to build and run the specific containers for my demo.

docker-compose up

What I was left with was a clean build in exactly the condition I needed it to be to do my demos. Now I’m not saying I wasn’t nervous, because not having working demos on the morning of the conference is a nerve wracking thing, but I knew I could get back to a steady state, so this whole issue resulted in one line in the blog post for that day. πŸ™‚ Without automation I would be trying to find and fix the problem, or manually rebuilding everything under time pressure, which is a sure fire way to make mistakes.

I do some demos on Oracle Database Cloud Service too. When I recently switched between trial accounts my demo VM was lost, so I provisioned a new 18c DBaaS, uploaded a script and ran it. Setup complete.

Confidence

Automation is quicker. I think we all get that. Having a reliable build process means you have the confidence to throw stuff away and build clean at any point. Think about it.

  • Developers replacing their whole infrastructure whenever they want. At a minimum once per sprint.
  • Deployments to environments not just deploying code, but replacing the infrastructure with it.
  • Environments fired up for a single purpose, maybe some automated QA or staff training, then destroyed.
  • When something goes wrong in production, just replace it. You know it’s going to work because it did in all your other environments.

Having reliable automation brings with it a greater level of confidence in what you are delivering, so you can spend less time on unplanned work fixing stuff and focus more on delivering value to the business.

Tooling

The tooling you choose will depend a lot on what you are doing and what your preferences are. For example, if you are focusing on the RDBMS layer, it is unlikely you will choose Docker for anything other than little demos. For some 3rd party software it’s almost impossible to automate a build process, so you might use gold images as your starting point or partially automate the process. In some cases you might use the cloud to provide the automation for you. The tooling is less important than the mindset in my opinion.

Check out the rest of the series here.

Cheers

Tim…

Why Automation Matters : Lost Time

Sorry for stating what I hope is the obvious, but automation matters. It’s mattered for a long time, but the constant mention of Cloud and DevOps over the last few years has thrown even more emphasis on automation.

If you are not sure why automation matters, I would just like to give you an example of the bad old days, which might be the current time for some who are still doing everything manually, with separate teams responsible for each stage of the process.

Lost Time : Handover/Handoff Lag

In the diagram below we can see all the stages you might go through to deploy a new application server. Every time the colour of the box changes, it means a handover to a different team.

So there are a few things to consider here.

  • Each team is likely to have different priorities, so a handover between teams is not necessarily instantaneous. The next stage may be waiting on a queue for a long time. Potentially days. Don’t even get me started on things waiting for people to return from holiday…
  • Even if an individual team has created build scripts and has done their best to automate their tasks, if it is relying on them to pick something off a queue to initiate it, there will still be a handover delay.
  • When things are done manually people make mistakes. It doesn’t matter how good the people are, they will mess up occasionally. That is why the diagram includes a testing failure, and the process being redirected back through several teams to diagnose and fix the issue. This results in even more work. Specifically, unplanned work.
  • Manual processes are just slower. Running an installer and clicking the “Next” button a few times takes longer than just running a script. If you have to type responses and make choices it’s going to take even more time, and don’t forget that bit about human error…

Let’s contrast this to the “perfect” automated setup, where the request triggers an automated process to deliver the new service.

In this example, the request initiates an automated workflow that completes the action and delivers the finished product without any human intervention along the path. The automation takes as long as it takes, and ultimately has to do most of the same work, but there is no added handover lag in this process.

I think it’s fair to say you would be expecting a modern version of this process to complete in a matter of minutes, but I’ve seen the manual process take weeks or even months, not because of “work time”, but because of the idle handover time and human processes involved…

They Took Our Jobs!

At first glance it might seem like this is a problem if you are employed in any of the teams responsible for doing the manual tasks. Surely the automation is going to mean job cuts right? That depends really. In order to fully automate the delivery of any service you are going to have to design and build the blocks that will be threaded together to produce the final solution. This is not always simple. Depending on your current setup this might mean having fewer, more highly skilled people, or it might require more people in total. It’s impossible to know without knowing the requirements and the current staffing levels. Also, cloud provides a lot of the building blocks for you, so if you go that route there may be less work to do in total.

Even if the number of people doesn’t change as part of the automation process, you are getting work through the door quicker, so you are adding value to the business at a higher rate. From a DevOps perspective you have not added value to the business until you’ve delivered something to them. All the hours spent getting part of the build done equate to zero value to the business…

But we are doing OK without automation!

No you’re not! You’re drowning! You just don’t know it yet!

I never hear people saying they haven’t got enough projects waiting. I always hear people saying they have to shelve things because they don’t have time staff/resources/time to do them.

As your processes get more efficient you should be able to reallocate staff to projects that add value to the business, rather than wasting their lives on clicking the “Next” button.

If your process stays inefficient you will always be saying you are short of staff and every new project will require yet another round of internal recruitment or outsourcing.

Is this DevOps?

I’m hesitant to use the term DevOps as it can be a bit of a divisive term. I struggle to see how anyone who understands DevOps can’t see the benefits, but I think many people don’t know what it means, and without the understanding the word is useless…

Certainly automation is one piece of the DevOps puzzle, but equally if you have company resistance to the term DevOps, feel free to ignore it and focus on trying to sell the individual benefits of DevOps, one of which is improved automation…

Check out the rest of the series here.

Cheers

Tim…

PS. Conway’s Law – Melvin Conway 1967

“organizations which design systems … are constrained to produce designs which are copies of the communication structures of these organizations.”

In language I can understand.

If you have 10 departments, each process will have 10 sections, with a hand-off between them.