[Resolved] Olympus Storage Degraded

Around 11am this morning we were alerted to a broken HDD in Olympus on the /home partition.

This is a RAID-10 array. We had engineers replace the HDD for a spare we had on site.

Unfortunately the on-site spare isn’t functioning correctly, and we’ve ordered a new drive from Dell.

This will not arrive until Monday the 4th of August, so until then, the drive array is degraded with potential full failure possible if another drive in the same stripe as the failed drive dies.

 

Monday August 4th – 12:15 - New HDD is installed and the RAID is currently rebuilding.
Tuesday August 5th – 09:45 – Storage array is now fully functioning correctly.

StriKe user interface failure

10:00 A problem has been reported with the web interface to the StriKe email filtering system. This issue has now been escalated to the developers, as we have not been able to fix the problem immediately. The routing and delivery of inbound messages has NOT been affected. However, we do apologise for users’ inability to access the StriKe control panel, and would like to assure you that we are working on obtaining a fix asap.

14:30 The issue has been fixed, and users are now able to log into the StriKe filtering service again. We apologise for the inconvenience this caused. The problem was due to a timezone conflict in an update schedule which was preventing essential updates to take place. During the time that the StriKe user interface was unavailable, mail delivery was unaffected.

Tagged

Network Latency

21:36 – We are currently investigating what appears to be latency on our network, at the moment it appears only some shared servers are affected, however, we will provide more information as soon as network engineers provide us with an update.

21:42 – Problem identified with network switch, scheduled reboot of device at 21:46.

21:48 – Switch restarted and networking fabric restored, all access issues should now be resolved, however, should you continue to have access problems,  please raise a support request at https://support.krystal.co.uk

Scheduled Poseidon Move – 18/07/2014

Today is the scheduled migration of Poseidon to a new server.

We have already provisioned the new server and prepared it to become the new Poseidon.

10:00 – Data transfer has started between the servers to prepare for account transfer.
12:00 – Data transfer has occurred much quicker than anticipated and we may move the servers earlier than planned
13:00 – We’ve moved over to the new server now, much earlier than anticipated – we still need to reboot the server and move it physically
14:39 – The server move is complete – please let us know if you have any problems

Network Maintenance – Wednesday 16th July 06:00 (Completed)

Dear All,

The network upgrade that was postponed from Friday has been rescheduled for this coming Wednesday morning, starting at 06:00

We have made use of the extra time to further improve our preparedness and taken on board feedback about the time of maintenance windows. Hopefully early mid-week will affect fewer people in the eventuality of disruption.

We will update this blog post as the work takes place.

Thank you for your patience during this essential network maintenance.

 

06:00 We are starting this work now.

06:30 This work has been completed successfully. Thank you for your patience.

Network Maintenance 11th July 18:00 (Postponed)

Dear All,

We will be undertaking critical core network maintenance tomorrow evening commencing after 18:00.

This has the potential to affect all servers/services and while we hope that there will be no noticeable impact there could be a loss of connectivity of up to 2 minutes.

We thank you for your patience and understanding during this maintenance window.

20:55 We have started this work

21:36 We’re experiencing intermittent issues that we believe are caused by a software bug in the switch stack. We’re rebooting it now.

21:50 Network configuration has been reverted to the previous setup for the moment, all services should now be fully restored.

——-

23:08 – There appears to be some network latency/packet loss which is unrelated to the earlier network operations, we are currently investigating the cause.

23:27 –  The network has returned to optimal preformance, we apologise again for any inconvenience that this may have caused.

Valhalla MySQL issues (Resolved)

10.20am – A cPanel update has knocked out MySQL on Valhalla. We are working to get this resolved and will post updates here as and when we have more information.

Only services using MySQL are affected – Mail (POP & IMAP), static content such as HTML and dynamic content such as PHP is working as normal.

11:39 – We have discovered that the update has caused data corruption on a few databases. We are investigating options and will provide more information shortly.

12:00 – We have decided for expediency to restore all data from the snapshot we took this morning. Due to the size of the data it may take us a number of hours to fully restore all databases to a consistent state. We will update this post again once we have more information.

14:31 – Data has been successfully copied from the ArK snapshot taken this morning. However, it may be possible to restore the current data, which we’d much rather do if possible. Thank you for your continued patience.

14:32 – MySQL has been successfully started and all databases are in consistent state. No data lost.

Thank you for your patience during this incident.

 

Network Latency (Resolved)

16:07 – There appears to be network latency, of which we are investigating – We will update when we know more.

16:20 – Connectivity has returned to normal, we are currently investigating the cause of the latency.

Update: It appears that a distributed dential of service (DDoS) attack was aimed at one of our shared hosting servers, this in turn was causing some latency on the network.

mysql issue : ceres.uksrv.co.uk

9:35 Instability of the mysql service forced a restart. We are attempting to bring the service down gracefully.

9:45 MySQL and Apache were halted for a few minutes so we could restart under controlled conditions. The services are now available again.

Poseidon Unplanned Outage. (Resolved)

16:15 – We have been alerted to a possible issue with Poseidon. It is not responding to ping, engineers are investigating.

16:20 – Poseidon has been sent for a reboot, we are awaiting its return.

16:29 – Poseidon has restarted, all services should now start to operate as expected.

Follow

Get every new post delivered to your Inbox.

Join 68 other followers