Server Problems : Any ideas?

 

I’m pretty sure last night’s problem was caused by a disk failure in the RAID array. The system is working now, but it might go down sometime today to get the disk replaced. Hopefully they won’t do what they did last time and wipe the bloody lot! 🙂

The subject of this post is not about this specific incident, but more generally about what I’ve experienced over recent weeks/months.

Background

As many of you will know I’ve been having some intermittent issues with my server for a while now.

It’s not predictable. Sometimes it doesn’t happen for ages. Sometimes it happens several times in one day. Most of the time I notice it pretty quickly, reboot it and all is fine for a while. Sometimes, like last night, it happens while I’m asleep and it is down for a few hours before I notice it.

The last time it was giving me a lot of trouble I planned to spend a weekend rebuilding the site on AWS on two nodes with a load balancer between them to give me a bit of extra availability, but by the time the weekend came it was behaving again and I decided maybe the extra cost was not a great idea. 🙂

Without knowing what the problem is, I’m worried I will go to the effort of moving everything, only to find I get the same problems in the new location, especially if it is something stupid I’ve done. 🙂

What do I see? Nothing!

If I don’t spot it before, I get a message from Uptime Robot saying the site is down. It only checks ever 15 minutes, so it’s usually some time earlier that the actual failure occurs. When it happens I usually check the following.

  • The last entries in the webserver access and error logs. So far they never show me anything interesting, just that the webserver stops delivering pages when the issue happens.
  • The output from sar shows nothing out of the ordinary from a load perspective (CPU, memory, disk, network) in the lead up to the failure. The server is massively overpowered for what I need, so everything looks pretty much idle most of the time.
  • There is nothing relevant in the “/var/log/messages” file. The entries just stop, then start again when the server is rebooted. There is no pattern to the last message in the log and the failure. I don’t get a lot of messages logged here, so typically the last message is several hours before the failure.
  • Nothing in the MySQL logs.
  • Nothing in the cron log. No jobs firing near the time of failure, that could have caused a problem.
  • Nothing in any of the other logs available in the “/var/log” directory, or subdirectories.
  • Previously I have done memory tests, hard disk scans and checks on the RAID config, which don’t reveal any problems. Obviously, this time there is a RAID issue, but this is not typically the case when I have these issues.

Typically, when it happens I can’t do anything on the server. Not even SSH to it. I have to force a restart using the admin tool on the hosting company admin console. I once managed to get a serial connection to the server, just as it was happening, but I was not able to run any commands, so it didn’t help.

It’s a dedicated server running fully patched CentOS 6. It has MySQL 5.7 and PHP7, but the issues pre-date those. Previously it was running MySQL 5.6 and whatever PHP version came with the CentOS 6 yum repository. 🙂

Question

Anyone got any ideas what I can check next time it happens? I’m at a bit of a loss.

Like I said, I don’t want to up-sticks and move to a new server or VM on AWS if the problem is of my own making. 🙂

Cheers

Tim…

PS. Remember, I’m not a system administrator. I just know enough to be dangerous. 🙂

Author: Tim...

DBA, Developer, Author, Trainer.

5 thoughts on “Server Problems : Any ideas?”

  1. You can try watchdog
    http://www.sat.dundee.ac.uk/psc/watchdog/watchdog-background.html

    Also – does “hard disk scans” included S.M.A.R.T. checks as well? And what is on your monitoring? Any patterns? Do you have any by the way? Something like munin for example: http://munin-monitoring.org/

    Last time I experienced something like this we ended up with bug report for RedHat (it was RedHat Enterprise) and it was fixed a couple of months later with regular updates.

    And migration to AWS…. well – I believe it is worth it 🙂 Really.

  2. misiaq: I don’t have “performance problems” as such. I don’t see it going slow. It seems to be binary. Working or not. 🙂

    I’ll take a look at that stuff. Thanks!

    Cheers

    Tim…

  3. Hi Tim!
    I had a similar issue with one of servers in my previous company. Problem was with disks mounted to root partition, and it seems that the same is for your case, mostly because you can’t ssh into it (disks are not able to write log for ssh session).

  4. Hi Tim.
    I have a thing to say, but I’m not specific to this RAID Problem, because you were saying you could not SSH to Server etc.
    You do not like to use Webmin Tool? You are indeed MASTER with CLI, but you can add huge number of things to your Webmin Tool and with one just click you can watch/monitor all the stuff.

    PS :
    Two years ago I worked in a small company in Frankfurt there three of the colleagues worked only with Oracle DB and they had no big idea about Server setup etc.. When they wished for I shared the knowledge starting with Group Packaging, YUM Repo etc on your site added to that the tools like Webmin, Wireshark, Monit (not my favorite although), Memcached. They were very glad. And I said them that I was spoiled by reading your articles and enhanced my skills 🙂

Comments are closed.