Separation of Duties (Poll Results Discussed)

On the back of the recent patching polls I asked a couple of questions about separation of duties.

As always, the sample size is small and my followers have an Oracle bias, so you can decide how representative you think these number are…

Separation of Duties

Here was the first question.

Regarding GI/DB, do you take advantage of separation of duties? Meaning separate people/groups looking after GI, ASM and DB on the server. Or does the DBA do all of it?

This is exactly what I expected. For the vast majority of companies, the DBAs are responsible for the Grid Infrastructure (GI), Automatic Storage Manager (ASM) and the database (DB).

When Oracle first started floating the idea of separation of duties it kind-of surprised me, as I had never worked with a company that cared about it. Sure they have System Administrators that look after the OS, and maybe provision new disks on the server, but I have never experienced a situation where anyone other than the DBAs do anything with the Oracle side of things.

Don’t get me wrong, if that’s what a company wants to do, it’s good that Oracle make it possible, but I think the vast majority of people just don’t care! What’s more, I think it’s likely to cause more problems than it solves.

GI/DB Ownership

This was the second question, which was suggested by Aishwarya Kala.

Regarding GI/DB installations where the DBAs do all work on the system, have you split the ownership of the GI and DB installations between different users?

It’s interesting that nearly 90% of people have the DBAs doing all the work on the servers, but nearly 50% still split the ownership of the Grid Infrastructure and the database software.

Back in the day nobody talked about separation of duties and the “oracle” OS user owned everything. When discussing separation of duties, Oracle suggested the Grid Infrastructure should be owned by a different OS user, maybe “grid”, and the database carries on as before, typically using the “oracle” OS user. Then the documentation started to push the separation of ownership. Next the installation started to warn you if you used a common user. So now it’s got to the point where people think it is wrong to use a single user as the owner of the GI and database.

I am one of those people that use the “oracle” user as the owner of both the Grid Infrastructure and the database. If you have no separation of duties, I see no point in splitting these between two users. Occasionally I get questions about this in relation to my Vagrant RAC builds, and my response is simple. I don’t work in an environment with separation of duties, so I think splitting the ownership of the GI and database is pointless.

Personally, I think Oracle should remove the warnings from the installer and be more balanced in the documentation. If the poll results of representative of the wider audience, clearly very few people care about separation of duties. It should be an option, not the default assumption.

Cheers

Tim…

Joel Kallman Day 2022 : Announcement

Since 2016 we’ve had an Oracle community day where we push out content on the same day to try and get a bit of a community buzz. The name has changed over the years, but in 2021 it was renamed to the “Joel Kallman Day”. Joel was big on community, and it seems like a fitting tribute to him.

When is it?

The date is Tuesday October 11th. That’s two weeks from today!

How do I get involved?

Here is the way it works.

  • Write a blog post. The title should be in the format “<insert-the-title-here> #JoelKallmanDay“.
  • The content can be pretty much anything. See the section below.
  • Tweet out the blog post using the hashtag #JoelKallmanDay.
  • Publishing the posts on the same day allows us to generate a buzz. In previous years loads of people were on twitter retweeting, making it even bigger. The community is spread around the world, so the posts will be released over a 24 hour period.
  • Oracle employees are welcome to join in. This is a community day about anything to do with the Oracle community.

Like previous years, it would be really nice if we could get a bunch of first-timers involved, but it’s also an opportunity to see existing folks blog for the first time in ages! 

The following day I write a summary post that includes links to all the posts that were pushed out through the day. You can see examples here.

What Should I Write About?

Whatever you want to write about. Here are some suggestions that might help you.

  • My favourite feature of {the Oracle-related tech you work on}.
  • What is the next thing on your list to learn.
  • Horror stories. My biggest screw up, and how I fixed it.
  • How the cloud has affected my job.
  • What I get out of the Oracle Community.
  • What feature I would love to see added to {the Oracle-related tech you work on}.
  • The project I worked on that I’m the most proud of. (Related to Oracle tech of course)

It’s not limited to these. You can literally write about anything Oracle-related. The posts can be short, which makes it easy for new people to get involved. If you do want to write about something technical, that’s fine. You can also write a simple overview post and link to more detailed posts on a subject if you like. In the previous years the posts I enjoyed the most were those that showed the human side of things, but that’s just me. Do whatever you like. 

Do I have to write in English?

No! It’s great to see people contributing to their own community. Google Translate does a pretty good job of translating them, so we can still read them.

Do I need to write about Joel or APEX?

I’m sure people would be happy to read stories about Joel, or content about APEX, but you don’t have to write about that. You can write about whatever you want, so long as it has an Oracle spin…

So you have two weeks to get something ready!

Cheers

Tim…

Oracle Database Patching (Poll Results Discussed)

Having recently put out a post about database patching, I was interested to know what people out in the world were doing, so I went to Twitter to ask.

As always, the sample size is small and my followers have an Oracle bias, so you can decide how representative you think these number are…

Patching Frequency

Here was the first question.

How often do you patch your production Oracle GI/DB installations? (Pick the nearest that applies)

There was a fairly even spread of answers, with about a third of people doing quarterly patching, and a quarter doing six-monthly patching. I feel like both these options are reasonable. About 20% were doing yearly patching, which is starting to sound a little risky to me. The real downer was over 22% of people never patch their databases. This is interesting when you consider the recent announcement about monthly recommended patches (MRPs).

For those people that never patch, I can think of a few reasons off the top off my head why.

  • Lack of testing resource. I think patch frequency has more to do with testing than any other factor. If you have a lot of databases, the testing resource to get through a patching cycle can be quite considerable. This is why you have to invest some time and money into automated testing.
  • If it ain’t broke, don’t fix it. The problem is, it is broken! How long after your system has been compromised will it be before you notice? How are your customers going to feel when you have a data breach and they find out you haven’t even taken basic steps to protect them? I don’t envy you explaining this…
  • Fear of downtime. I know downtime is a real issue to some companies, but there are several ways to mitigate this, and you have to balance the pros and the cons. I think if most people are honest, they can afford the downtime to patch their systems. They are just using this as an excuse.
  • Patching is risky. I understand that patches can introduce new issues, but that is why there are multiple ways to patch, with some being more conservative from a risk perspective. I think this is just another excuse.
  • Out of support database versions. I think this is a big factor. A lot of people run really old versions of the database that are no longer in support, and are no longer receiving patches. I don’t even think I need to explain why this is a terrible idea. Once again, how are you going to explain this to your customers?
  • Lack of skills. We like to think that every system is looked after by a qualified DBA, but the reality is that is just not true. I get a lot of questions from people who are SQL Server and MySQL DBAs that have been given some Oracle databases to look after, and they freely admit to not having the skills to look after them. Even amongst Oracle DBAs there is a massive variation in skills. Oracle patching has improved over the years, but it is still painful compared to other database engines. Just saying.

Type of Patching

This was the second question.

When patching your production Oracle GI/DB installations, which method do you use?
In-Place = Current ORACLE_HOME
Out-Of-Place = New ORACLE_HOME

This was a fairly even split, with In-Place winning by a small margin. Oracle recommend Out-Of-Place patching, but I think both options are fine if you understand the implications. I discussed these in my previous post.

Conclusion

I think of patch frequency in a similar way to upgrade frequency. If you do it very rarely, it’s really scary, and because nobody remembers what they did last time, there are a bunch of problems that occur, which makes everyone nervous about the next patch/upgrade. There are two ways to respond to this. The first is to delay patching and upgrades as long as possible, which will result in the next big disaster project. The second is to increase your patch/upgrade frequency, so everyone becomes well versed in what they have to do, and it becomes a well oiled machine. You get good at what you do frequently. As you might expect, I prefer the second option. I’ve fought long and hard to get my company into a quarterly patching schedule, and it will only decrease in frequency over my dead body!

Assuming the results of these polls are representative of the wider community, I feel like Oracle need to sit up and take notice. Patching is better than it was, but “less bad” is not the same as “good”. It is still too complicated, and too prone to introducing new issues IMHO!

Cheers

Tim…

Database Patching : It’s a difficult subject

If you came hear hoping I was going to say there are valid reasons not to patch, you are out of luck. There is never a valid reason not to patch…

Instead this post is more about the general approach to patching. I’ve spent 22+ years writing about Oracle, including how to install it, but I’ve written practically nothing about how to patch a database. My stock answer is “read the patch notes”, and to be honest that is probably the best thing anyone can do. Although patching is a lot more standardized these days, it’s still worth reading the patch notes in case something unexpected happens. In this post I just want to talk about a few top-level things…

Patching to a new ORACLE_HOME

There are two big reasons for patching to a new ORACLE_HOME, or out-of-place patching.

  1. You can apply the binary patches to the new home while the database is still running in the old home, so you reduce the total amount of downtime.
  2. You have a natural fallback in the event of the wanting to revert the patch. You don’t have to wait for the patch rollback to complete.

There are some downsides though.

  1. It requires extra space to hold both the unpatched and patched homes, until you reach a point where you are happy to remove the unpatched home.
  2. If you have any scripts that reference the ORACLE_HOME, they will need to be updated. Hopefully you’ve centralized this into a single environment setup script.
  3. I guess it’s a little more complicated, and the patch notes are not that helpful.

So should you follow the recommendation of patching to a new home or not? The answer as always is “it depends”.

The reduction in downtime for a single instance database is good, but if you are running RAC or Data Guard, this isn’t really an issue as the database remains online for most of the patching anyway. Having a quick fallback is great, but once again if you are running RAC or Data Guard this isn’t a big deal.

If you are running without RAC or Data Guard, you have made a decision that you can tolerate a certain level of downtime, so is taking the system down for an hour every quarter that big a deal? I’ve heard of folks who use RAC and/or Data Guard who still bring the whole system offline to patch, so the decision is probably going to be very different for people, depending on their environment and the constraints they are working with.

I hope you’re taking OS and database backups before patching. If something catastrophic happens, such that a rollback of the patch is not possible, you can recover your original home and database from the backups. Clearly this could take a long time, depending on how your backups are done, but the risk of loss is low. So the question is, can you tolerate the additional downtime?

You have to make a decision on the pros and cons of each approach for you, and of course deal with the consequences. If in doubt, go with the recommendation and patch to a new home.

Read-only Oracle homes

Read-only Oracle homes were introduced in 18c (here) as an option, and are the default from Oracle 21c onward. One of the benefits of read-only Oracle homes is they make switching homes so much easier. You haven’t got to worry about copying configuration files between homes, as they are already located outside the home.

Release Update (RU) or Release Update Revision (RUR)?

You have a choice between patching using a Release Update (RU), or a Release Update Revision (RUR). To put it simply, a RU contains not only the latest security patches and regression fixes, but may also include additional functionality, so the risk of introducing a new bug is higher. A RUR is just the security patches and regression fixes. Unlike the Critical Patch Updates (CPUs) of the past, that ran on endlessly, RURs are tied to specific RUs, so you will end up applying the RUs, but at a later date, when hopefully the bugs have been sorted by the RUR…

The folks at Oracle suggest applying the RUs, which is what I (currently) do. Some in the Oracle community suggest applying RURs is the safer strategy. If you look at the “Known Issues” for each RU, and the list of recommended one-off patches that should be applied after the RU, you can see why some people are nervous of going directly to RUs.

Once again, this comes down to you and your experience of patching with the feature set you use. If you are finding RUs are too problematic, go with the RUR approach. You can always change your mind at any time…

Monthly Recommended Patches (MRPs)

There’s a new kid on the block starting with 19.17 on Linux, which are monthly recommended patches (MRPs). They replace RURs. There are 6 MRPs per RU, with each MRP containing the RU and the current batch of recommended one-off patches, as documented in MOS Note 555.1.

I’m assuming these are rolling and standby-first patches, but I can’t confirm that yet.

RAC Patching : Rolling Patches

Rolling patches can be applied one node at a time, so there are always database instances running, which means the database remains available for the whole of the patching process.

Release Updates (RUs) and Release Update Revisions (RURs) are always rolling patches, so it makes sense to take advantage of this approach. If you are applying one-off patches, these may not be rolling patches, so always check the patch notes to make sure.

Even when rolling patches are available, you can still make the decision to take the whole system offline to apply the patches. I’m not sure why you would want to do this, but the option is there for you.

Data Guard : Standby-First Patches

Release Updates (RUs) and Release Update Revisions (RURs) are always standby-first patches. This gives you some flexibility on how you approach patching your system. Here are two scenarios with a two node Data Guard setup, where node 1 is the primary and node 2 is the standby.

Scenario 1 : Switchovers

  • Patch the node 2 binaries (not datapatch) and bring the standby back into recovery mode.
  • Switchover roles, making the node 2 the primary and node 1 the standby.
  • Patch the node 1 binaries (not datapatch) and bring the standby back into recovery mode.
  • Run datapatch against node 2 (the primary database).
  • Optionally switchover roles making node 1 the primary database again.

Scenario 2 : No switchovers

  • Patch the node 2 binaries (not datapatch), but don’t start the standby.
  • Patch the node 1 binaries (not datapatch) and start the database.
  • Start the standby on node 2.
  • Run datapatch on node 1 (the primary).

Scenario 1 reduces downtime, as the primary is always running while the standby is having the binaries patched. Scenario 2 is simpler, but has a more extensive downtime as the primary is out of action while the binaries are being patched.

Remember, one-off patches may not be standby-first patches, so you may only have the option of scenario 2 when applying them. You have to read the patch notes.

OJVM Patching : Which approach?

Oracle 21c has simplified the OJVM patching situation. In previous releases the OJVM patches were completely separate. The grid infrastructure (GI) and database patches for 21c include the OJVM patches. For 19c the OJVM patches are still separate.

The separate 19c OJVM patches come with additional restrictions. They are not standby-first patches, and according to the patch notes, they can only be applied as RAC rolling patches if you use out-of-place patching.

Why don’t you write about patching much?

Writing about patching is difficult, because everyone has a unique environment, and their own constraints placed on them by their business. I’ve always avoided writing too much about patching because I know it’s opening myself up for criticism. Whatever you say, someone will always disagree because of their unique situation, or demand yet another patching scenario because of their unique environment. You’re damned if you do, and damned if you don’t.

I’ve recently written a few patching articles for specific scenarios (here). I may add some more, but it’s not going to be a complete list, and don’t expect me to write articles about stuff I don’t use, like Exadata. These are purely meant as inspiration for new people. Ultimately, you need to read the patch notes and decide what is best for you!

Let the cloud do it!

If all this is too much hassle, you do have the option of moving your database to the cloud and letting them worry about patching it. 🙂

Conclusion

Read the patch notes!

Cheers

Tim…

Vagrant : “SSH auth method: private key” – Timed out…

Out of nowhere I recently started to get problems with Vagrant running on a Windows 11 host. The “vagrant up” command would always hang at the “SSH auth method: private key” stage. You can see an example of the output here.

    default: SSH address: 127.0.0.1:2222
    default: SSH username: vagrant
    default: SSH auth method: private key
Timed out while waiting for the machine to boot. This means that
Vagrant was unable to communicate with the guest machine within
the configured ("config.vm.boot_timeout" value) time period.

If you look above, you should be able to see the error(s) that
Vagrant had when attempting to connect to the machine. These errors
are usually good hints as to what may be wrong.

If you're using a custom box, make sure that networking is properly
working and you're able to connect to the machine. It is a common
problem that networking isn't setup properly in these boxes.
Verify that authentication configurations are also setup properly,
as well.

If the box appears to be booting properly, you may want to increase
the timeout ("config.vm.boot_timeout") value.

I didn’t get the same issue on other hosts (Windows 10 and macOS), so I figured it was something specific to Windows 11.

That started me Googling, which came back with a bunch of results, each of which appeared to work for some people, but not all. I’ve just found one that worked for me, so I figured I would write this post to bring together the fixes I tried, regardless of success for me. If you are experiencing this issue, maybe one of these will fix things for you.

Kernel Panic – Add more vCPUs

This was my issue, and of course it was the last suggestion I found. Typical! After trying all of those proposed fixes below I eventually started watching the VirtualBox console during boot and I noticed there was a kernel panic being reported, and the boot was stalling. This seemed to be different to all the other problems people were encountering. When I started to Google for VirtualBox kernel panics, I came across a forum thread where people said they were seeing this when they had a VM with a single vCPU. It just so happened my problems were occurring when I was doing a RAC build, where the DNS node has a single vCPU. Since you build the DNS node first, I was being blocked. I never thought to try other builds, as the DNS build was the simplest and quickest. 🙂

As soon as I switched the DNS node to use 2 vCPUs the kernel panics stopped and I was able to move forward.

Some references suggest maxing out the video memory also. This can be done by adding the following entry to the Vagrantfile.

vb.customize ["modifyvm", :id, "--vram", "128"]

Using a single vCPU for the DNS node worked fine on Windows 10. My Windows 10 machine has the same version of VirtualBox and Vagrant installed, so it wasn’t anything to do with those software versions directly.

Extend Boot Timeout

The notes produced by the timeout suggest increasing the timeout by adding the following to the Vagrantfile.

config.vm.boot_timeout = 600

Needless to say this was the first thing I tried and it didn’t work for me.

BIOS Settings

Some people experienced this symptom when the virtualization features were disabled in their BIOS. I checked my BIOS and all the virtualization features were enabled, so this wasn’t my issue.

Some people updated their BIOS, which fixed the issue for them.

Windows Hypervisor Clash

Some people experienced a similar symptom when they had Hyper-V enabled on their PCs. The recommendations vary, but many people suggest turning off the following features.

  • Hyper-V (this was not present on my PC)
  • Virtual Machine Platform (already disabled for me)
  • Windows Hypervisor Platform (already disabled for me)

I found several threads discussing this, but this is the thread I remember seeing first.

If you want to check this, click the Windows button and type “Turn Windows features on or off” to open the dialog.

VirtualBox and Vagrant Software Versions

Some people experienced a similar symptom after upgrading VirtualBox, Vagrant or both. I had recently upgraded to VirtualBox 6.1.38 and Vagrant 2.3.0, but both had worked fine (I think) since the upgrade. Despite this I tried the following:

  • Removing VirtualBox and Vagrant and installing the previous versions.
  • Removing VirtualBox and Vagrant and installing the previous version of Virtual Box, and the latest version of Vagrant.
  • Removing VirtualBox and Vagrant and installing the latest version of Virtual Box, and the previous version of Vagrant.

None of these made a difference for me, so I concluded it wasn’t related to the version of VirtualBox and Vagrant specifically.

Cable Connected for Network Adapter

For some people their virtual networks were not marked as connected, as shown by checking out the network adapters in the VirtualBox setting for the resulting VM. To fix this they added the following to their Vagrantfile.

config.vm.provider "virtualbox" do |vb|
    vb.customize ["modifyvm", :id, "--cableconnected1", "on"]
end

My network adapters were all marked as “Cable Connected”, even without this Vagrantfile setting, so this wasn’t my problem.

SSH Keys

Some people were having general problems using the default insecure keys. In case you didn’t know, Vagrant boxes are built using a known insecure key. Once you create a VM from a box a new secure key is created and the insecure key is removed. It’s just a way to make sure Vagrant can always connect without a password the first time it boots a new VM.

For some people this default action was causing them problems, so they registered their own secure key and booted the boxes directly with this, setting the secure key in the Vagrantfile.

Conclusion

It’s unfortunate that several situations can result in the same symptom, which lead me to chase my tail for several days. Fortunately I had other machines I could use for builds, so I wasn’t totally stumped.

Hopefully you won’t have to, but if you run into this issue, give some of these a try and see if any of them fix things for you.

Cheers

Tim…