Currently tracking 1,100,000 samples

Outage Report

Yesterday Sparkle was unavailable for 1 hour 13 minutes. This is what happened.

16.04.13

Yesterday Sparkle was unavailable for 1 hour 13 minutes, its longest downtime since 28 June 2009. I am very sorry about this and the disruption it caused our customers. I know it was annoying and I do apologise.

On the bright side no data was lost, and should the problem ever happen again Sparkle ought to be restored in under 5 minutes.

So what happened? The analysis is organised in three sections:

  • Prevention
  • Detection
  • Response

Prevention

To prevent this happening again we need to know what happened this time. After investigating thoroughly the answer is...the server froze. It is not clear why.

I examined all the relevant logs on the server and found nothing: no error messages, in fact no activity at all. Leading up to the outage everything was normal. There are no clues anywhere as to what went wrong.

Unfortunately this happens to servers from time to time. They are complex beasts and every once in a while they fail inexplicably. This seems to be one of those occasions.

Although there isn't a specific problem to avoid in future, we could bypass any reoccurence by having a standby server to switch to. As ever it's a matter of priorities.

Detection

The detection side of things worked very well: I was alerted within 1 minute that there was a problem. This was ideal.

Response

This was the first downtime since moving to a new server last November. It turns out that the new hosting provider has a rather different procedure for raising support tickets from the previous one. From learning of the problem to rebooting the server took 47 minutes b

Once the server rebooted, the database failed to start. It took a further 26 minutes to find and fix that problem: the database had balked at a (valid) configuration change.

Once the database was up, Sparkle was back open for business.

All in all, despite best efforts, this was a disappointing response. The good news is that I now know the procedure for emergency-rebooting the server, and the database is happy. Responding to a similar problem in future, Sparkle should be available again within a few minutes.

Finally

The best place for up to the minute news is @sparklehq on Twitter. You can also inspect Sparkle's status page. And you can always email or phone (+49 7763 927 3407).

I appreciate everyone's patience and I apologise again for the inconvenience.

Next article →

Client Access

← Previous article

Archiving old Clients and Collections

All articles

See all articles