High Availability : How much availability do you really need?

I had a discussion with a few folks about high availability (HA) and I thought I would write down some of my thoughts on it. I’m sure I’ve made many of these points before, but maybe not collectively in this form.

Before we start

This is not specifically about Oracle products. It applies equally well to any technology, but I will use Oracle as an example, because that’s what I know.

When I speak about a specific product or technology I am only talking about it in the context of HA. I don’t care what other benefits it brings, as they are not relevant to this discussion.

This is not about me saying, “you don’t need to buy/use X”. It’s me asking you to ask yourself if you need X, before you spend money and time on it.

How much downtime can you really tolerate?

This is a really simple question to ask, but not one you will always get a considered answer to. Without thinking, people will demand 24×7 operation with zero downtime, yet you ask for a downtime window to perform a task and it will get approved. Clearly this contradicts the 24×7 stance.

As a company you have to get a good grasp of what downtime you can *really* tolerate. It might be different for every system. Think about interfaces and dependencies. If system A is considered “low importance”, but it is referenced by system B that is considered “high importance”, that may alter your perception of system A, and its HA requirements.

There are clearly some companies that require as close to 100% availability as possible, but also a lot don’t. Many can get away with planned downtime, and provided failures don’t happen too often, can work through them with little more than a few grumbles. We are not all the same. Don’t get lead astray by thinking you are Netflix.

The more downtime you can tolerate, the more HA options are available to you, and the simpler and cheaper your solutions can become.

What is the true cost of your downtime?

The customers of some companies have no brand loyalty. If the site is down, the customers will go elsewhere. Some companies have extreme brand loyalty and people will tolerate being messed around.

If Amazon is down, I will wait until it is back online and make the purchase. There could be a business impact in terms of the flow of work downstream, but they are not going to lose me as a customer. So you can argue Amazon can’t tolerate downtime, or you can argue they can.

I used to play World of Warcraft (WOW), and it always irritated me when they did the Wednesday server restarts, but I just grumbled and waited. Once again, their customer base could tolerate planned downtime.

In some cases you are talking about reputational damage. If an Oracle website is down it’s kind-of embarrassing when they are a company that sells HA solutions. Reputational damage can be quite costly.

This cost of downtime for planned maintenance and failures has to factor into your decision about how much downtime you can tolerate.

Can you afford the uptime you are demanding?

High availability costs money. The greater the uptime you demand, the more it’s going to cost you. The costs are multi-factored. There is the cost of the kit, the licenses and the people skills. More about people later.

If you want a specific level of availability, you have to be willing to invest the money to get it. If you are on a budget, good luck with that 99.99+% uptime… 🙂

Do you have the skills to minimize downtime?

It’s rare that HA comes for free from a skills perspective. Let’s look at some scenarios involving Oracle databases.

  • Single instance on VM: You are relying on your virtual infrastructure to handle failure. Your DBAs can have less HA experience, but you need to know your virtualization folks are on form.
  • Data Guard: Your DBAs have to know all the usual skills, but also need good Data Guard skills. There is no point having a standby database if you don’t know how to use it, or it doesn’t work when you need it.
  • Real Application Clusters (RAC): Now your DBAs need RAC skills. I think most people would agree that RAC done badly will give you less availability than a single instance database, so your people have to know what they are doing.
  • RAC plus Data Guard: I think you get the point.

We often hear about containers and microservices as the solution to all things performance and HA related, but that’s going to fail badly unless you have the correct skills.

Some of these skills can be ignored if you are willing to use a cloud service that does it for you, but if not you have to staff it! That’s either internal staff, or an external consultancy. If you skimp on the skills, your HA will fail!

What are you protecting against?

The terms high availability (HA) and disaster recovery (DR) can kind-of merge in some conversations, and I don’t want to get into a war about it. The important point is people need to understand what their HA/DR solutions can do and what they can’t.

  • Failure of a process/instance on a single host.
  • Failure of a host in a cluster located in a single data centre.
  • Failover between data centres in the same geographical region.
  • Failover between data centres in different geographical regions.
  • Failover between planets in the same solar system.

You get the idea. It’s easy to put your money down and think you’ve got HA sorted, but have you really? I think we’ve all seen (or lived through) the stories about systems being designed to failover between data centres, only to find one data centre contains a vital piece of the architecture that breaks everything if it is missing.

Are all your layers highly available?

A chain is only as strong as the weakest link. What’s the point of spending a fortune on sorting out your database HA if your application layer is crap? What’s the point of having a beautifully architected HA solution in your application layer if your database HA is screwed?

Teams will often obsess about their own little piece of the puzzle, but a failure is a failure to the users. They aren’t going to say, “I bet it wasn’t the database though!”

Maybe your attention needs to be on the real problems, not performing HA masturbation on a layer that is generally working fine.

Who are you being advised by, and what is their motivation?

Not everyone is coming to the table with the same motivations.

  • Some vendors just want to sell licenses.
  • Some consultants want to charge you for expensive skills and training.
  • Some consultants and staff want to get a specific skill on their CV, and are happy for you to pay them to do that.
  • Some vendors, consultants and staff don’t engage their brain, and just offer the same solution to every company they encounter.
  • Some people genuinely care about finding the best solution to meet your needs.

Over my career I’ve seen all of these. Call me cynical, but I would suggest you always question the motives of the people giving you advice. Not everyone has your best interests at heart.

So what do you do?

In my current company we use our virtual infrastructure for basic HA. The databases (Oracle, MySQL and SQL Server) typically run as single instances in a VM, and failover to a different host in the same data centre or a different data centre depending on the nature of the failure. There are some SQL Server databases that use AlwaysOn, but I see little benefit to it for us.

Every so often the subject of better database HA comes up. We can tolerate a certain level of downtime for planned maintenance and failures, and the cost and skills required for better HA are not practical for us at this point. This position is correct for us as we currently stand. It may not be the correct option in future, and if so we will revisit it.

For the middle tier we do the normal thing of multiple app servers (VMs or containers) behind load balancers.

I could probably build a very convincing case to do things differently to make my job a little “sexier”, but that would be a dick move. As it happens I want to move everything to the cloud so I can stop obsessing about the boring stuff and let the cloud provider worry about it. 🙂

Conclusion

There is no “one size fits all” solution to anything in life. As the mighty Tom Kyte said many times, the answer is always, “it depends”. If people are making decisions without discussing the sort of things I’ve mentioned here, I would suggest their decision process is flawed. Answers like, “but that’s what Twitter does”, are unacceptable.

Cheers

Tim..

Oracle Upgrade/Migration : What method are you using?

I tweeted a couple of days ago about an important upgrade/migration I was doing yesterday. I was moving a smallish, but high profile, database from Oracle 11g on HP-UX Itanium to Oracle 12c on Oracle Linux inside a VMware VM. Almost immediately Joey D’Antoni came back with the question, “What method are you using?” I thought it might be worth writing a little something about the decision process.

When you are deciding how to upgrade and/or migrate a database there are a few things to consider:

  • Platform : Prior to Oracle 10g, if you wanted to change platforms, you didn’t have much of a choice, it was exp/imp only. As well as introducing data pump, Oracle 10g introduced the ability to convert datafiles and image copy backups using the RMAN CONVERT command. This meant transportable tablespaces were a valid option for platform migration, with or without upgrades.
  • Size : Above a certain size, waiting for an expdp/impdp is not practical. You are going to want to minimise downtime.
  • Allowable Downtime : Despite what the marketing people would have us believe, not all databases have to be 24×7. You should always try to minimise downtime, but it is not always necessary to eliminate it completely. If downtime is an issue there are options, but they may involve additional effort and cost.
  • Money : You can do almost anything if you are willing to throw time and money at it.
  • Junk : When you take a look at a database that has gone through successive upgrades, you can often see the lingering signs of old crap and bad decisions that will haunt you forever. It is really nice if you can start fresh and correct some of the mistakes of the past.

Not everyone has the luxury of large amounts of downtime, so what can you do to minimise downtime? Here are a few options, each having their own pros and cons, as well as restrictions depending on if your focus is migration, upgrade or both. It’s not an exhaustive list. 🙂

  • Transportable tablespaces, or transportable database, reduce migration time to about the time it takes to copy the files between servers. Depending on your storage tech, this could be a very short time indeed. The nice thing about this is it can be used across platforms and between versions, so an “upgrade” using TTS can be really quick. I used this for a Solaris to Oracle Linux move about a year ago.
  • You can use backups to restore a database to a new location, then keep recovering it using archived redo logs until the changeover time. Then ship the final archived redo logs, recovery the new database and you’re done. You could do the same thing with incrementally updated image copy backups. This is more about migration than upgrade.
  • When you are reading the previous point, you are probably thinking that sounds similar to Data Guard, and you could use this to switch machines and/or perform a rolling upgrade. If you are using standard edition, you could use Dbvisit Standby. Either way, the migration could be as quick as a switchover.
  • You can do an RMAN DUPLICATE to switch servers. It’s not going to upgrade your database, but you can use this to move it, then upgrade later.
  • You can replicate between the old and new system until your changeover point. You might use something like Golden Gate or Dbvisit Replicate to do this.
  • You can use the good old expdp/impdp. It’s probably going to be slow, and require a lot of downtime, but it’s logically simple.

There are lots of variations on a theme. The important point is you pick what’s right for you.

So what did I pick for this upgrade/migration? It was good old expdp/impdp. Why?

  • The database was relatively small.
  • The downtime involved was acceptable. It’s mostly accessed during the day and I started the process stupidly early to minimise the effect on the users.
  • It allowed me to clean up a lot of crap in the process. The database was about 18 years old and had been through previous upgrades and server migrations. It was bearing the scars of those and I wanted to clean it all up.
  • It allowed me to both upgrade and migrate in a single step, so it was logically very simple.

As always, there is not a *best* solution. You have to pick what is right for you and the constraints you are working with.

How did it go? Fine. We had already gone through the process in a Dev and Test environment, and I had done a run through on the new production kit also, so I knew it would work and I also had accurate timings, which made it easier to sell it to the decision makers.

Happy days!

Cheers

Tim…

PS. We will inevitably have some firefighting to do, as people always forget about the odd interface or application that connects to the system once in a blue moon. The old instance is turned off, so we should find out if anything has been forgotten pretty quick! 🙂