Post mortem for the Jan. 8th outage

Reason for the outage

An executive summary can be found on the bottom of this post.

At  around 18:00 on January 8th, our central office in Cyprus (housing all of our Cypriot servers) was hit by lightning. The thunder was heard at a reported 20km range. Every single person that has heard it reported it as the loudest thunder they have ever heard. Seeing flashes of lightning jumping through a computer room is one of the scariest things you will ever see :-). At the exact moment of the lightning, one of our technicians was in the affected computer room. He has reported that he saw the lightning jumping from a cable onto the ceiling, where a small chunk of it was missing (more on this later).

The lightning travelled through unprotected ethernet cables, the building's electrical distribution and (surprisingly) the building's iron concrete reinforcement.

Affected systems

As a result of the (almost) direct lightning strike a number of systems were affected:
  • One of our core routers has been damaged beyond repair.
  • We lost 25% of our servers. A small number of these servers was later brought back online, using spare parts we keep on hand.
  • An Internet uplink's "modem" was damaged.
  • The central panel for the first floor was damaged.
  • Electrical distribution was partly damaged on the 2nd floor.
  • Two WiFi access points were damaged.
  • A core network switch was partly damaged (two ports fried).
  • An auxiliary network switch (not a core switch, there are 2 of those ;-)) was completely damaged.
  • Concrete was damaged (more like blown to pieces) at three different locations where the iron reinforcing was used by the lightning as a path to ground.
  • The "false ceiling" (used for cable runs) in the computer room was damaged. Smoking pieces flying as far as 3m outside the computer room (door open at the time, since a technician was inside).
  • Two APC surge protectors were damaged (special thanks to APC for shipping us replacement units through DHL express).
  • One electrical metering unit was damaged.
  • A 60A fuse in front of the metering unit was MELTED (that's a minimum of 14.4KW of power).
  • Just for fun: An electrical plug's cap was blown 1 meter away from the wall.
  • The road's entire lights were blown out.

Steps we took to rectify the outage

  • Immediately following the lightning we rushed to check that all electrical protection (UPS/surge protectors/building's ground) was still functional. We kept the remaining infrastructure running on UPS power until we were sure that we could safely fail over to the generators.
  • Since the building's electrical distribution was partly damaged, we had to get a little creative. We failed over to a separate electrical distribution circuit just so that we could route power from the generators to the computer room. This took longer than expected (since that circuit was only designed to deliver alternative power from EAC (Electricity Authority of Cyprus)). As a result a number of internal systems were shut down so that we could preserve the UPS power until we got the re-routing sorted.
  • As soon as the generator power was safely re-routed, we began bringing the downed systems back online. At this point we noticed that an Internet uplink was damaged (yes we did receive SMS alerts that the servers were down, but obviously electrical distribution > Internet). We immediately contacted our upstream provider and they implemented a temporary workaround for the damaged "modem" within 1 hour.
  • After an extensive check (so that the building didn't catch fire in the middle of the night), we decided that all electrical related work to restore the building's electrical distribution system should be performed the next day.
  • The next day we had an electrician come out and re run a check and repairs. The repairs took a total of 3 days (until Monday morning).
  • While the electrician was working, we were troubleshooting all the servers we couldn't bring back online the previous night. A small number (2) were brought back online after we swapped out damaged parts from our spares bin :-).
  • Our upstream provider came round early next morning replacing the Internet uplink's "modem".
  • EAC's technicians came round for a round of checks and repairs on their side of the electrical distribution.
  • The damaged concrete sections have been scheduled to be repaired. They are not part of the building's supporting structure.
  • The false ceiling has been scheduled for repair. We checked that all cables passing through it were not damaged.

Summary

TL;DR: Lightning>zap>poof>boom>sorted out what we could