I had a discussion with a few folks about high availability (HA) and I thought I would write down some of my thoughts on it. I’m sure I’ve made many of these points before, but maybe not collectively in this form.
Before we start
This is not specifically about Oracle products. It applies equally well to any technology, but I will use Oracle as an example, because that’s what I know.
When I speak about a specific product or technology I am only talking about it in the context of HA. I don’t care what other benefits it brings, as they are not relevant to this discussion.
This is not about me saying, “you don’t need to buy/use X”. It’s me asking you to ask yourself if you need X, before you spend money and time on it.
How much downtime can you really tolerate?
This is a really simple question to ask, but not one you will always get a considered answer to. Without thinking, people will demand 24×7 operation with zero downtime, yet you ask for a downtime window to perform a task and it will get approved. Clearly this contradicts the 24×7 stance.
As a company you have to get a good grasp of what downtime you can *really* tolerate. It might be different for every system. Think about interfaces and dependencies. If system A is considered “low importance”, but it is referenced by system B that is considered “high importance”, that may alter your perception of system A, and its HA requirements.
There are clearly some companies that require as close to 100% availability as possible, but also a lot don’t. Many can get away with planned downtime, and provided failures don’t happen too often, can work through them with little more than a few grumbles. We are not all the same. Don’t get lead astray by thinking you are Netflix.
The more downtime you can tolerate, the more HA options are available to you, and the simpler and cheaper your solutions can become.
What is the true cost of your downtime?
The customers of some companies have no brand loyalty. If the site is down, the customers will go elsewhere. Some companies have extreme brand loyalty and people will tolerate being messed around.
If Amazon is down, I will wait until it is back online and make the purchase. There could be a business impact in terms of the flow of work downstream, but they are not going to lose me as a customer. So you can argue Amazon can’t tolerate downtime, or you can argue they can.
I used to play World of Warcraft (WOW), and it always irritated me when they did the Wednesday server restarts, but I just grumbled and waited. Once again, their customer base could tolerate planned downtime.
In some cases you are talking about reputational damage. If an Oracle website is down it’s kind-of embarrassing when they are a company that sells HA solutions. Reputational damage can be quite costly.
This cost of downtime for planned maintenance and failures has to factor into your decision about how much downtime you can tolerate.
Can you afford the uptime you are demanding?
High availability costs money. The greater the uptime you demand, the more it’s going to cost you. The costs are multi-factored. There is the cost of the kit, the licenses and the people skills. More about people later.
If you want a specific level of availability, you have to be willing to invest the money to get it. If you are on a budget, good luck with that 99.99+% uptime… 🙂
Do you have the skills to minimize downtime?
It’s rare that HA comes for free from a skills perspective. Let’s look at some scenarios involving Oracle databases.
- Single instance on VM: You are relying on your virtual infrastructure to handle failure. Your DBAs can have less HA experience, but you need to know your virtualization folks are on form.
- Data Guard: Your DBAs have to know all the usual skills, but also need good Data Guard skills. There is no point having a standby database if you don’t know how to use it, or it doesn’t work when you need it.
- Real Application Clusters (RAC): Now your DBAs need RAC skills. I think most people would agree that RAC done badly will give you less availability than a single instance database, so your people have to know what they are doing.
- RAC plus Data Guard: I think you get the point.
We often hear about containers and microservices as the solution to all things performance and HA related, but that’s going to fail badly unless you have the correct skills.
Some of these skills can be ignored if you are willing to use a cloud service that does it for you, but if not you have to staff it! That’s either internal staff, or an external consultancy. If you skimp on the skills, your HA will fail!
What are you protecting against?
The terms high availability (HA) and disaster recovery (DR) can kind-of merge in some conversations, and I don’t want to get into a war about it. The important point is people need to understand what their HA/DR solutions can do and what they can’t.
- Failure of a process/instance on a single host.
- Failure of a host in a cluster located in a single data centre.
- Failover between data centres in the same geographical region.
- Failover between data centres in different geographical regions.
- Failover between planets in the same solar system.
You get the idea. It’s easy to put your money down and think you’ve got HA sorted, but have you really? I think we’ve all seen (or lived through) the stories about systems being designed to failover between data centres, only to find one data centre contains a vital piece of the architecture that breaks everything if it is missing.
Are all your layers highly available?
A chain is only as strong as the weakest link. What’s the point of spending a fortune on sorting out your database HA if your application layer is crap? What’s the point of having a beautifully architected HA solution in your application layer if your database HA is screwed?
Teams will often obsess about their own little piece of the puzzle, but a failure is a failure to the users. They aren’t going to say, “I bet it wasn’t the database though!”
Maybe your attention needs to be on the real problems, not performing HA masturbation on a layer that is generally working fine.
Who are you being advised by, and what is their motivation?
Not everyone is coming to the table with the same motivations.
- Some vendors just want to sell licenses.
- Some consultants want to charge you for expensive skills and training.
- Some consultants and staff want to get a specific skill on their CV, and are happy for you to pay them to do that.
- Some vendors, consultants and staff don’t engage their brain, and just offer the same solution to every company they encounter.
- Some people genuinely care about finding the best solution to meet your needs.
Over my career I’ve seen all of these. Call me cynical, but I would suggest you always question the motives of the people giving you advice. Not everyone has your best interests at heart.
So what do you do?
In my current company we use our virtual infrastructure for basic HA. The databases (Oracle, MySQL and SQL Server) typically run as single instances in a VM, and failover to a different host in the same data centre or a different data centre depending on the nature of the failure. There are some SQL Server databases that use AlwaysOn, but I see little benefit to it for us.
Every so often the subject of better database HA comes up. We can tolerate a certain level of downtime for planned maintenance and failures, and the cost and skills required for better HA are not practical for us at this point. This position is correct for us as we currently stand. It may not be the correct option in future, and if so we will revisit it.
For the middle tier we do the normal thing of multiple app servers (VMs or containers) behind load balancers.
I could probably build a very convincing case to do things differently to make my job a little “sexier”, but that would be a dick move. As it happens I want to move everything to the cloud so I can stop obsessing about the boring stuff and let the cloud provider worry about it. 🙂
Conclusion
There is no “one size fits all” solution to anything in life. As the mighty Tom Kyte said many times, the answer is always, “it depends”. If people are making decisions without discussing the sort of things I’ve mentioned here, I would suggest their decision process is flawed. Answers like, “but that’s what Twitter does”, are unacceptable.
Cheers
Tim..