We became aware at around 18:20hrs of an issue with Ceres – specifically apache. Despite normal levels of http activity, the server was experiencing very high loads. At this time we are still trying to determine if this may be a hardware fault that is subtle enough to not have triggered alerts on the racks.
We have a specialist working hard to determine the exact nature of the issue and will update this article as soon as we have more information.
Wednesday 21:30 – This isn’t good news – ceres is very poorly. After spending considerable time trying to discover where the errors were on the disk subsystems, we determined that the local primary raid controller was developing subtle errors, which caused corruption on partitions that house the main OS and server software. Ironically if it had failed in more spectacular fashion, the situation may have been more quickly recovered. Replacing the controller is now pointless due to the state of the array data. We are now into a hardware migration and bare metal restore which will probably take a number of hours to complete.
Rest assured we will be working on this throughout the night. We will update this post again as we get closer to a more solid fix time.
Wednesday 22:30 – A completely new server has now been configured and installed in situ as a replacement for Ceres. 24 CPU Cores (up from 8 ) with faster core and bus speed, plus 32Gb of RAM (double the old Ceres) and 600Gb of Raid 1 storage (again twice as much) is going to make Ceres a very powerful platform. Tom is now overseeing the restoration of our standard OS and WHM provision, and over 200Gb of customer data from our CDP backup system.
Thursday 06:20 – 95% of all user data has now been copied to the new hardware. The most recent data is being rsynced from the old ceres /home partition (which was safe) to ensure as little data loss as possible.
Thursday 08:30 – All user data restored, and server opened to production. We are still carrying out final tweaks, but we believe the services are now stable.
Thursday 09:00 – We are aware of an issue with sub-domains serving their parent domains content, we hope to have this fixed shortly
Thursday 11:40 – Subdomains are now functioning correctly. Mail is being converted to Dovecot – this may disrupt your ability to collect mail for the next hour or so while the mailbox conversion completes.