This is a follow on from my server problems post from yesterday…
Regarding the general issue, misiaq came up with a great suggestion, which was to use watchdog. It’s not going to “fix” anything, but if I get a reboot when the general issue happens, that would be much better than having the server sit idle for 5 hours until I wake up. 🙂 Let’s see how that works out…
Praveen asked if I use any tools like Webmin. The answer is yes and no. Just like my use of any tool (Cloud Control, SQL Developer etc.) I use a combination of command line and tools. I usually find command line more useful as I can script and reuse, but I always have tools available to fill in the gaps and provide inspiration. I don’t always invest enough time in learning the tools well, which is why useful bits of them pass me by on occasion, but I also don’t like to become dependent on tools. In the case of Webmin, it is installed on the server, but it is not exposed to the outside world. I have to tunnel in to use it, so during a problem, when I can’t SSH to the server, Webmin is not available. 🙂
Back to the specific issue from yesterday…
During my normal checks I noticed my RAID1 setup looked like this.
# cat /proc/mdstat Personalities : [raid1] md3 : active raid1 sdb3[1] 970470016 blocks [2/1] [_U] md1 : active raid1 sdb1[1] 4194240 blocks [2/1] [_U] unused devices: <none> #
Last time it looked like this, one of the hard drives had died, so I contacted the hosting company to get it sorted. After a couple of false starts, they eventually took the machine offline, tested it and said the hard drives were fine. 🙁
I added the partitions from the “/dev/sda” disk back into the RAID config. I guess I should have tried that first. 🙂
# mdadm /dev/md1 -a /dev/sda1 mdadm: added /dev/sda1 # mdadm /dev/md3 -a /dev/sda3 mdadm: added /dev/sda3 #
After it had finished rebuilding it looked like this.
# cat /proc/mdstat Personalities : [raid1] md3 : active raid1 sda3[0] sdb3[1] 970470016 blocks [2/2] [UU] md1 : active raid1 sda1[0] sdb1[1] 4194240 blocks [2/2] [UU] unused devices: <none> #
So it looks like the drive just dropped out of the RAID config for no reason… Does that happen?
As I said before, I’m not a system administrator. I just know enough to be dangerous. 🙂 Thanks for the comments from yesterday. They have been very helpful…
Cheers
Tim…