Server Problem : A Resolution?

AWSIt’s been a pretty annoying couple of days on the website server front.

The server locking up intermittently is one thing, and for all I know, maybe my fault? The incompetence of the hosting company is quite something else.

Just so you are aware why I was doing my nut yesterday, the hosting company had disabled my ability to force a power cycle of my dedicated server while they did a hardware test. They forgot to re-enable it when they finished. I rang to ask them to re-enable it and also power cycle to server. It took them over 70 minutes to achieve the power cycle and it was the following day before the interface to allow me to force a power cycle was enabled again. Amateurs!

They offered to give me a free month of hosting, but I refused. Last night I moved the whole thing to Amazon Web Services so that’s the new home for the website. I finished the build and testing, then flipped the DNS and went to bed, figuring the DNS propagation can take up to 24 hours, so why hang around. 🙂

Regarding AWS:

  • I’ve gone for a pretty small instance type at the moment. I’ll see how that goes and expand if needed. It seems OK to me at the moment.
  • It’s just a single VM for now. If that proves problematic I’ll consider adding another and shoving a load balancer in front of it all. I’ve had plenty of practice with load balancers recently. 🙂
  • It’s just in a single European data centre. I’ve gone for the cheap and cheerful approach of not paying for the Multi Availability Zone option. So when the data centre in Ireland goes down and I start complaining, remind me I’m a cheapskate. 🙂
  • My email is still being handled by the old clowns. I’ve got to find a new home for that.
  • I’m not sure how much this is going to cost at this point. I’ll keep an eye on it over the next few days/weeks and decide if this is the right move. Once I’ve made a decision, I’ll buy a reserved instance of the appropriate size, which will reduce the costs a bit. Either that, or look for an alternative option that’s cheaper. 🙂

Fingers crossed.

Cheers

Tim…

Server Problems : Update

hard-disk-42935_640This is a follow on from my server problems post from yesterday…

Regarding the general issue, misiaq came up with a great suggestion, which was to use watchdog. It’s not going to “fix” anything, but if I get a reboot when the general issue happens, that would be much better than having the server sit idle for 5 hours until I  wake up. 🙂 Let’s see how that works out…

Praveen asked if I use any tools like Webmin. The answer is yes and no. Just like my use of any tool (Cloud Control, SQL Developer etc.) I use a combination of command line and tools. I usually find command line more useful as I can script and reuse, but I always have tools available to fill in the gaps and provide inspiration. I don’t always invest enough time in learning the tools well, which is why useful bits of them pass me by on occasion, but I also don’t like to become dependent on tools. In the case of Webmin, it is installed on the server, but it is not exposed to the outside world. I have to tunnel in to use it, so during a problem, when I can’t SSH to the server, Webmin is not available. 🙂

Back to the specific issue from yesterday…

During my normal checks I noticed my RAID1 setup looked like this.

# cat /proc/mdstat
Personalities : [raid1]
md3 : active raid1 sdb3[1]
      970470016 blocks [2/1] [_U]

md1 : active raid1 sdb1[1]
      4194240 blocks [2/1] [_U]

unused devices: <none>
#

Last time it looked like this, one of the hard drives had died, so I contacted the hosting company to get it sorted. After a couple of false starts, they eventually took the machine offline, tested it and said the hard drives were fine. 🙁

I added the partitions from the “/dev/sda” disk back into the RAID config. I guess I should have tried that first. 🙂

# mdadm /dev/md1 -a /dev/sda1
mdadm: added /dev/sda1
# mdadm /dev/md3 -a /dev/sda3
mdadm: added /dev/sda3
#

After it had finished rebuilding it looked like this.

# cat /proc/mdstat
Personalities : [raid1]
md3 : active raid1 sda3[0] sdb3[1]
 970470016 blocks [2/2] [UU]

md1 : active raid1 sda1[0] sdb1[1]
 4194240 blocks [2/2] [UU]

unused devices: <none>
#

So it looks like the drive just dropped out of the RAID config for no reason… Does that happen?

As I said before, I’m not a system administrator. I just know enough to be dangerous. 🙂 Thanks for the comments from yesterday. They have been very helpful…

Cheers

Tim…

Server Problems : Any ideas?

I’m pretty sure last night’s problem was caused by a disk failure in the RAID array. The system is working now, but it might go down sometime today to get the disk replaced. Hopefully they won’t do what they did last time and wipe the bloody lot! 🙂

The subject of this post is not about this specific incident, but more generally about what I’ve experienced over recent weeks/months.

Background

As many of you will know I’ve been having some intermittent issues with my server for a while now.

It’s not predictable. Sometimes it doesn’t happen for ages. Sometimes it happens several times in one day. Most of the time I notice it pretty quickly, reboot it and all is fine for a while. Sometimes, like last night, it happens while I’m asleep and it is down for a few hours before I notice it.

The last time it was giving me a lot of trouble I planned to spend a weekend rebuilding the site on AWS on two nodes with a load balancer between them to give me a bit of extra availability, but by the time the weekend came it was behaving again and I decided maybe the extra cost was not a great idea. 🙂

Without knowing what the problem is, I’m worried I will go to the effort of moving everything, only to find I get the same problems in the new location, especially if it is something stupid I’ve done. 🙂

What do I see? Nothing!

If I don’t spot it before, I get a message from Uptime Robot saying the site is down. It only checks ever 15 minutes, so it’s usually some time earlier that the actual failure occurs. When it happens I usually check the following.

  • The last entries in the webserver access and error logs. So far they never show me anything interesting, just that the webserver stops delivering pages when the issue happens.
  • The output from sar shows nothing out of the ordinary from a load perspective (CPU, memory, disk, network) in the lead up to the failure. The server is massively overpowered for what I need, so everything looks pretty much idle most of the time.
  • There is nothing relevant in the “/var/log/messages” file. The entries just stop, then start again when the server is rebooted. There is no pattern to the last message in the log and the failure. I don’t get a lot of messages logged here, so typically the last message is several hours before the failure.
  • Nothing in the MySQL logs.
  • Nothing in the cron log. No jobs firing near the time of failure, that could have caused a problem.
  • Nothing in any of the other logs available in the “/var/log” directory, or subdirectories.
  • Previously I have done memory tests, hard disk scans and checks on the RAID config, which don’t reveal any problems. Obviously, this time there is a RAID issue, but this is not typically the case when I have these issues.

Typically, when it happens I can’t do anything on the server. Not even SSH to it. I have to force a restart using the admin tool on the hosting company admin console. I once managed to get a serial connection to the server, just as it was happening, but I was not able to run any commands, so it didn’t help.

It’s a dedicated server running fully patched CentOS 6. It has MySQL 5.7 and PHP7, but the issues pre-date those. Previously it was running MySQL 5.6 and whatever PHP version came with the CentOS 6 yum repository. 🙂

Question

Anyone got any ideas what I can check next time it happens? I’m at a bit of a loss.

Like I said, I don’t want to up-sticks and move to a new server or VM on AWS if the problem is of my own making. 🙂

Cheers

Tim…

PS. Remember, I’m not a system administrator. I just know enough to be dangerous. 🙂

Please be patient!

angry-1300616_640It’s extremely nice to have a big audience. It’s very flattering that people care enough about what I say to be bothered to read it. The problem with having a large audience is people can get a very demanding at times. 🙂

I’ve mentioned the 1% rule (or 1-9-90) before. The number of people producing content in an internet community is really small compared to the number of people consuming it. That discrepancy can cause problems when it comes to interactions between the consumers and producers.

When someone wants to speak to me (or any other person producing content in the community) they see it as a 1:1 interaction. In actual fact it’s really a Many:1 interaction, because you are not the only person wanting to interact. 🙂

Now I don’t want to come across as butthurt by this. It’s a really nice problem to have, if you know what I mean, but it is a problem.

I have a whole list of new stuff I want to produce. I have a whole list of corrections I need to make to the articles on the site. I do community stuff like presenting. I have a full time job that is nothing to do with the website or the Oracle community. This does not leave a great deal of time for dealing with people on a one-to-one basis.

If you want to speak to me I will try to answer, but please don’t take offence if I can’t. Just because my status on Facebook/Twitter/Google+ is online, it doesn’t mean I am free to speak to you. The fact I don’t respond doesn’t mean I don’t care.

In reference to this, Tom Kyte once said, “The more you do, the more people want you to do.” That’s so true.

Cheers

Tim…

Video : Flashback Version Query

Today’s video gives a quick run through of flashback version query.

If you prefer to read articles, rather than watch videos, you might be interested in these.

The star of today’s video is Tanel Poder. I was filming some other people, he saw something was going on, came across and struck a pose. I figured he knew what I was doing, but it’s pretty obvious from the outtake at the end of the video he was blissfully unaware, but wanted in on the action whatever it was! A true star! 🙂

Cheers

Tim…

Learning to answer questions for yourself!

notes-514998_640It’s not important that you know the answer. It’s important you know how to get the answer!

I’m pretty sure I’ve written this before, but I am constantly surprised by some of the questions that come my way. Not surprised that people don’t know the answer, but surprised they don’t know how to get the answer. The vast majority of the time someone asks me a question that I can’t answer off the top of my head, this is what I do in this order.

  1. Google their question, often using the subject line of their post or email. A lot of the time, the first couple of links will give me the answer. Sometimes it’s one of my articles that gives me the answer. 🙂
  2. Search the Oracle documentation for the topic and/or quickly scan through the table of contents in the relevant manuals.
  3. Search My Oracle Support.

It is very rare I’ve not got the answer by the time I’ve finished (3). Typically it takes me about 5 minutes to complete this search.

If I get to the end of this process and I don’t have the answer, then I have to start some real investigation, which will often involve the same steps again, but taking more time on each, trying to exhaust my options for terms and phrases to search with.

When I get what I think is the answer, I test it to make sure it really is the answer!

Now I realise there are going to be things that fall outside of this process, but in my experience, the vast majority of questions people ask are pretty simple to answer if they follow this method. There are also some other advantages to doing it yourself.

  • You start to recognise people who are good in specific areas. That helps you build a list of trusted sources. The last thing you want to do is take advice off a chump!
  • During the process of trying to find answers for yourself, you invariably learn other things that are potentially more interesting and important than the answer to your initial question.
  • By testing stuff, you learn stuff. There is a lot of bad information out on the internet. Sometimes it’s just downright wrong. Sometimes it was correct for the version it was written against, but is not correct for the current version. Don’t trust anyone, even “good people”. Always confirm the answers for yourself.

Most people don’t walk round with all this stuff in their head, but they know where (and how) to look for the information when they need it!

“Give a man a fish and you feed him for a day. Teach a man to fish and you feed him for a lifetime.”

Go fish! 🙂

Cheers

Tim…

Video : Flashback Query

Today’s video is a quick demo of flashback query.

If you prefer to read articles, rather than watch videos, you might be interested in these articles.

The cameo for this video comes courtesy of Dina Blaschczok, a DBA based in South Africa and a friend of the family. When the wife goes down to SA, Dina takes care of her and occasionally introduces her to big cats. 🙂

Cheers

Tim…

WordPress 4.5 Released

WordPress 4.5 “Coleman” has been released.

I just applied it to the five WordPress sites I manage by manually triggering the auto-update and everything went through fine.

There are some updates to the standard themes that you will need to manually trigger for update, but there was no drama there either.

I fully expect a rash of little updates to get released over the coming days as new bugs are spotted. 🙂

Happy upgrading. 🙂

Cheers

Tim…