I’m pretty sure last night’s problem was caused by a disk failure in the RAID array. The system is working now, but it might go down sometime today to get the disk replaced. Hopefully they won’t do what they did last time and wipe the bloody lot! 🙂
The subject of this post is not about this specific incident, but more generally about what I’ve experienced over recent weeks/months.
As many of you will know I’ve been having some intermittent issues with my server for a while now.
It’s not predictable. Sometimes it doesn’t happen for ages. Sometimes it happens several times in one day. Most of the time I notice it pretty quickly, reboot it and all is fine for a while. Sometimes, like last night, it happens while I’m asleep and it is down for a few hours before I notice it.
The last time it was giving me a lot of trouble I planned to spend a weekend rebuilding the site on AWS on two nodes with a load balancer between them to give me a bit of extra availability, but by the time the weekend came it was behaving again and I decided maybe the extra cost was not a great idea. 🙂
Without knowing what the problem is, I’m worried I will go to the effort of moving everything, only to find I get the same problems in the new location, especially if it is something stupid I’ve done. 🙂
What do I see? Nothing!
If I don’t spot it before, I get a message from Uptime Robot saying the site is down. It only checks ever 15 minutes, so it’s usually some time earlier that the actual failure occurs. When it happens I usually check the following.
- The last entries in the webserver access and error logs. So far they never show me anything interesting, just that the webserver stops delivering pages when the issue happens.
- The output from sar shows nothing out of the ordinary from a load perspective (CPU, memory, disk, network) in the lead up to the failure. The server is massively overpowered for what I need, so everything looks pretty much idle most of the time.
- There is nothing relevant in the “/var/log/messages” file. The entries just stop, then start again when the server is rebooted. There is no pattern to the last message in the log and the failure. I don’t get a lot of messages logged here, so typically the last message is several hours before the failure.
- Nothing in the MySQL logs.
- Nothing in the cron log. No jobs firing near the time of failure, that could have caused a problem.
- Nothing in any of the other logs available in the “/var/log” directory, or subdirectories.
- Previously I have done memory tests, hard disk scans and checks on the RAID config, which don’t reveal any problems. Obviously, this time there is a RAID issue, but this is not typically the case when I have these issues.
Typically, when it happens I can’t do anything on the server. Not even SSH to it. I have to force a restart using the admin tool on the hosting company admin console. I once managed to get a serial connection to the server, just as it was happening, but I was not able to run any commands, so it didn’t help.
It’s a dedicated server running fully patched CentOS 6. It has MySQL 5.7 and PHP7, but the issues pre-date those. Previously it was running MySQL 5.6 and whatever PHP version came with the CentOS 6 yum repository. 🙂
Anyone got any ideas what I can check next time it happens? I’m at a bit of a loss.
Like I said, I don’t want to up-sticks and move to a new server or VM on AWS if the problem is of my own making. 🙂
PS. Remember, I’m not a system administrator. I just know enough to be dangerous. 🙂