This is a follow on from my server problems post from yesterday…
Regarding the general issue, misiaq came up with a great suggestion, which was to use watchdog. It’s not going to “fix” anything, but if I get a reboot when the general issue happens, that would be much better than having the server sit idle for 5 hours until I wake up. 🙂 Let’s see how that works out…
Praveen asked if I use any tools like Webmin. The answer is yes and no. Just like my use of any tool (Cloud Control, SQL Developer etc.) I use a combination of command line and tools. I usually find command line more useful as I can script and reuse, but I always have tools available to fill in the gaps and provide inspiration. I don’t always invest enough time in learning the tools well, which is why useful bits of them pass me by on occasion, but I also don’t like to become dependent on tools. In the case of Webmin, it is installed on the server, but it is not exposed to the outside world. I have to tunnel in to use it, so during a problem, when I can’t SSH to the server, Webmin is not available. 🙂
Back to the specific issue from yesterday…
During my normal checks I noticed my RAID1 setup looked like this.
# cat /proc/mdstat Personalities : [raid1] md3 : active raid1 sdb3[1] 970470016 blocks [2/1] [_U] md1 : active raid1 sdb1[1] 4194240 blocks [2/1] [_U] unused devices: <none> #
Last time it looked like this, one of the hard drives had died, so I contacted the hosting company to get it sorted. After a couple of false starts, they eventually took the machine offline, tested it and said the hard drives were fine. 🙁
I added the partitions from the “/dev/sda” disk back into the RAID config. I guess I should have tried that first. 🙂
# mdadm /dev/md1 -a /dev/sda1 mdadm: added /dev/sda1 # mdadm /dev/md3 -a /dev/sda3 mdadm: added /dev/sda3 #
After it had finished rebuilding it looked like this.
# cat /proc/mdstat Personalities : [raid1] md3 : active raid1 sda3[0] sdb3[1] 970470016 blocks [2/2] [UU] md1 : active raid1 sda1[0] sdb1[1] 4194240 blocks [2/2] [UU] unused devices: <none> #
So it looks like the drive just dropped out of the RAID config for no reason… Does that happen?
As I said before, I’m not a system administrator. I just know enough to be dangerous. 🙂 Thanks for the comments from yesterday. They have been very helpful…
Cheers
Tim…
Was there anything about the RAID md change in the messages log Tim?
BTW do you have something in place to let you know if your site is down? E.g. I use https://uptimerobot.com @uptimerobot which is pretty good and free for up to 50 monitors
Hi Tim.
I’m not big guy with Sysytem Admin things but I do know the basic things through training.
That “mdstat” actually did not strike in my mind but “mdadm -D /dev/mdx ” did strike in my mind.
Thanks for that “mdstat” !