This weekend our production system was switched from a 3-node to a 2-node RAC.
We were originally using a 2-node RAC (2 CPUs per node) and we added a third node because the system was struggling to cope with the workload. The third node helped us out in some ways, but it caused a lot of trouble in others. Ever since it’s inclusion it became impossible to take one node out of the RAC without bringing the lot crashing down, so much for high availability. In addition, a substantial proportion (about 30%) of the wait states on the system were due to inter-node communication. Now I expected with more nodes there would be more inter-node communication, but it seems a bit excessive. Heaven only knows what would happen in a 4-node cluster…
After a lot of banter with Oracle and HP we’ve finally decided to try a 2-node RAC again, but this time with 3 CPUs per node. OK, it’s actually 4 CPUs per node, but one CPU in each node is permanently offlined, so as not to affect our current Oracle licensing.
All the hardware modifications are complete and all tests indicate that the system is up and running normally. Of course the true test will happen tomorrow morning when the users log in and start to break things 🙂
The best news of all is that the move back to a 2-node cluster means that we can once again shut down one node at a time if we need to do maintenence. This is a big plus.
If everything goes quiet over the next few days it means that I’m fire-fighting and the switchover didn’t go well.
I’d be curious to see how many people out there are using RAC on more than 2 nodes. I’ve only done this on Tru64 with 1og Release 1, but I can say without a shadow of a doubt that it doesn’t work properly. I’m curious if this is Tru64 specific problem or if there is a fundamental flaw in RAC for clusters with more than 2 nodes.