10g CRS restart after power failure, feature or bug?

In 10g RAC the Cluster Ready Services (CRS) software is installed in it’s own $ORACLE_HOME, for the sake or argument let’s call this $CRS_HOME. In this directory there are a number of subdirectories including:

$CRS_HOME/crs/init
$CRS_HOME/css/init
$CRS_HOME/evm/init

When the CRS daemons are running these directories contain an assortment of files with names like:

myserver.mydomain.com.pid
.lock-myserver.mydomain.com
myserver.mydomain.com.lck

When CRS is shutdown cleanly these files are managed such that CRS will start up again without manual intervention, but when there is a power failure on one or more nodes the files aren’t cleaned up. The affect of this is that the CRS daemons won’t start properly until you manually clean up the mess.

RAC is a high availability solution, but it is crippled by a power failure. Is that a bug or a feature?

Note. I’m talking about the way CRS (10.1.0.3.0) works on Tru64. I’d be interested to know if it’s the same for CRS on other platforms. Also, I believe some changes have happened to the startup and shutdown of CRS in 10.1.0.4.0, but that’s not released for Tru64 yet, and a recent message on a HP forum suggests that Oracle will skip this patch and wait for 10.1.0.5.0 for Tru64.

Fun, fun, fun…

Cheers

Tim…

Author: Tim...

DBA, Developer, Author, Trainer. View all posts by Tim...

2 thoughts on “10g CRS restart after power failure, feature or bug?”

Hi Tim,

Have you had an opportunity to watch what happens when the private cluster interconnect fails in a RAC cluster? Are you using any bonding technologies with Tru64? We are investigating Sun’s IPMP to failover physical interfaces to avoid issues, and Oracle states that this is a “supported” solution. It would be nice if Oracle would stripe traffic across multiple interfaces if you devote them to cachefusion traffic.

– Ryan

Hi.

I can’t say I’ve ever seen the consequences of the interconnect failing. We’ve got a dual memory channel hub, so if the pair fail I guess it’s bye bye.

Apart from the dual memory channel, the only other thing we use is netRAIN to make two network cards act as one, giving the ability to withstand a single network card outage.

So we should withstand a single network card failure, or a single memory channel failure…

There’s los of stuff I like about RAC, but there’s lots of stuff that worries me too 🙂

Cheers

Tim…

Comments are closed.