Aphrodite Issues (Resolved)

We are experiencing some difficulties with aphrodite.krystal.co.uk – the server was rebooted due to an unknown software error around 17:05hrs. It is now experiencing an incredibly high load at what is a peak time. We are currently waiting for the server to stabilise pending an investigation into the nature of the original problem.

Update 24/04 19:32 One of the hard drives in Aphrodite’s RAID array has failed. We’re now removing the affected hard drive and will boot it as soon as it’s possible to do so safely. Please bare with us.

Update 19:48 The server is now back up, though in a “degraded” RAID mode – it only has 1 hard drive instead of the normal 2. Performance will be adversely affected until the RAID array is rebuilt. Thanks for your patience.

Update 20:33 A replacement hard drive has replaced the failed one and the RAID array is currently rebuilding. We expect this will take around 12 hours during which time server will be less responsive than normal as the data set is rebuilt. We therefore advise users to refrain from any heavy usage and be aware there is a higher than normal chance of an outage.

Update 21:41: The rebuild is at 20%

Update 00:15 The rebuild is at 80%

Update 25th April 01:00 The rebuild of the raid array is now complete. We can’t be 100% sure that this problem was caused by a failing disk, and so we will maintain our extra monitoring on this server for some time.

About these ads

22 thoughts on “Aphrodite Issues (Resolved)

  1. Hi, I can’t access my website at all. Is this due to the Aphrodite issue? Any idea how long it will be down for?

  2. Graham says:

    It appears i have a blackout i cant get on to my site or back end, anything to report at all??

  3. Guys – more problems – any ETA for a fix?

  4. Graham says:

    ok im back now, hopefully this is the last of it.

  5. krystalstatus says:

    Hi Guys,

    We are still not sure what’s going on with Aphrodite, we have increased the frequency of our monitoring and are keeping a close eye on it – sorry for the inconvenience.

  6. Jon Davies says:

    Hi, We really appreciate the close eye but surely you have an idea of what type of traffic caused the overload and the server to go down for whatever reason and can accurately trace the source and block it at a lower layer than you do currently.

    Your network team should be quickly able to identify the source of any connection overloads from a single source IP and block it; however as we are all on the same IP the same is true, one bad apple screws us all as per the Spanhaus issue.

    I look forward to reading the post mortem on the outage if it will ever be available.

    Jon.

    • krystalstatus says:

      Hi Jon,

      From what we can ascertain from our logs the server is running out of memory in a very short instant and this causes the server to become totally unresponsive. We used to often have problems like this before we installed cloud linux, but that stops the majority of these problems. The setup we currently have makes it very very difficult for a single user or site to crash the server, which is why this is hard to track down as it seems to be something off of the normal path, and possibly not related to traffic or network at all.

      • Jon Davies says:

        Thanks, really appreciate the honest response.

        I know there can be a million reasons for a sudden spike in memory usage, particularly a debilitating one.

        Linux is not not my strong point but completely understand the cloud and what it involves along with the segregation of sites.

        Look forward to the root cause being tracked down.

  7. graham says:

    Im off again, right at a crucial moment whats happening now??

  8. Jon Davies says:

    Hi,

    Can you please let us know how a box running so many sites can be taken down by a disk failure??

    The whole concept of RAID allows for an online hot spare, OK if a disk fails then performance would be degraded if we assume RAID 5 but surely the OS is running on a mirrored pair???? The ONLY way you could lose an entire server would be by having no mirrored pair for the OS or a disk controller fail.. even then the sites would be up but the admin CP would be accessible.

    This really isn’t good enough and really annoyed now.

    Regards,

    Jon.

    • krystalstatus says:

      Hello Jon,
      The OS IS in a RAID 1 (mirror) configuration, that’s why it’s back up already.

      The box hasn’t been “taken down” and we haven’t lost “an entire server” nor has there been any other condition that requires such alarming language.

      Furthermore, given that cPanel runs off the same machine (and disks) your statement about the sites being down but cPanel not is incorrect.

      A hard drive failed. We took the machine offline to check it and we’re now bringing it back, as quickly as is safe to do so. I don’t know about you but we’d rather take the extra 10 minutes to reboot the server than lose everyone’s data, so that’s exactly what we did.

      Please also bare in mind that we’re on the same side – getting aggressive really helps no one, we’re professionals and have configured the servers to survive incidents like this but things will and do go wrong with computer systems. We’d rather not be having to fix an issue if it were avoidable…

  9. Estimated time to resolution please?

  10. Sam says:

    Hi any idea’s when this will be fixed ? this has happened quite a few times since joining you guys not happy……

    • krystalstatus says:

      Hello Sam,
      The server is already back up. This has happened twice yesterday and once today, for a total of 3 times, and a hard drive failure is a nasty but rare occurence. We’d be very unlucky to have another so hopefully you’ll be o.k now!

  11. Hi, I have seen that the server is back up but I still have no access to my site. Can you give me an indication as to how long it will take to resolve the issue and everything is back to normal?

    • Sam says:

      Sam here still cant get on my sites …..

    • krystalstatus says:

      Hello Richard, Sam,
      The server’s rebuilding the entire data set from 1 hard drive, in addition to doing all the normal operations, it’ll be a little over-worked until the RAID array is rebuilt, but it’s ahead of schedule and already done 20%. Please bare with us. Tomorrow it should be business as usual.

  12. Chris Mosler says:

    I am trying to be patient as my sites disappear and reappear at will. What I am finding really distressing, however, is the holding page when it isn’t working. I really don’t need my readers and our clients seeing ‘Bad Gateway’ when they click on our sites…incredibly unprofessional and potentially very damaging.

    • krystalstatus says:

      Hello Christine,
      The alternative to the online rebuild we’re doing now is to take the machine offline entirely for 5-6 hours.

      We believe having the occasional blip every so often is better, hopefully you agree!

      The rebuild is at 80% so it’ll be done soon, ahead of schedule.

      We’ll look in to the 502 message, thanks for the feedback.

  13. graham says:

    im fixed brill hopefully this is the last of it.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 39 other followers

%d bloggers like this: