On May 27th, WhisperGifts went offline for a scheduled server relocation expected to last 8 hours or less. It was down for 44 hours and 43 minutes, not coming online until we moved it to a new hosting provider on May 29th and restored our backups.
This blog post is intended to document what happened, why we were susceptible to this problem, how we resolved it, and what we will do if a similar situation arises in the future. The image below shows an e-mail from the website monitoring company we use highlighting our downtime throughout May, and it's the most embarrassing email they've ever sent me. I've made it public here in an effort to never see such an e-mail again.
To begin with, I wanted to provide my sincere apologies to all WhisperGifts users. While we look pretty silly for having our website offline, you've got wedding guests trying to buy you gifts from a website that, at the time, wasn't working. This means you share the brunt of downtime complaints, for a problem you had no control over.
I'm incredibly sorry we put you in a position where your registry was down and your guests couldn't see your gift lists.
During the outage I provided e-mail updates to customers, including the offer of free upgrades or refunds. This offer still stands; please just respond to my e-mail from May May 28th.
Was any data lost?
No, we did not lose any data at all. I take meticulous backups, so I was able to restore those backups onto the new server. My friends and family are sick of me nagging them about taking proper backups, but in this case it's saved my bacon (and yours).
I'm not happy
Either am I. I'm rather annoyed that I've had to make you unhappy, too. Please send an email to me directly, at firstname.lastname@example.org and I'll make things right.
Will this happen again?
It shouldn't, but I'd be naive to rule it out. However if we come across a similar problem with physical server hardware, we'll be better placed to move our virtual servers to a new physical server within the Rackspace datacenter. This means we don't need to restore full backups or re-install our software, as we can just 'redeploy' to one of the many Rackspace servers.
We're working hard on new functionality at WhisperGifts. The ability to order printed gift cards is coming very soon, as is custom registry layouts for our premium members. Custom layouts are one of the most-requested features, and we can't wait to share it with you.
Stay with us!
If you've got any further questions or concerns, please don't hesitate to get in touch.
So, What happened?
- 1pm Monday 27th May: WhisperGifts goes offline, for a scheduled outage due to a data centre move by our hosting company. We had alerted users via Facebook & Twitter.
- 7pm Monday 27th May: Server due back online; still offline
- 8pm Monday 27th May: We learn of issues with the move. Outlook is good, and we expect to be online by morning. First post to Facebook notifying of the extension of the outage.
- 8pm Tuesday 28th May: Still offline, I decide to relocate WhisperGifts away from our hosting company. I spin up a new virtual server with Rackspace, and begin to restore the backup.
- Morning of Wednesday, 29th May: WhisperGifts site mostly back online. It took a while as I noticed a few issues with deployment process, and corrected them. I let things run through the day to make sure there were no errors, then monitor the website incredibly closely late into the night. Everything looked OK.
- 7:45am Thursday, 30th May: I publicly announce that WhisperGifts is again online. Everything has run perfectly since.
In the name of transparency, the rest of this gets pretty technical. Feel free to tune out if that isn't your cup of tea!
I want more detail. What actually went wrong?
We utilise virtual servers to host WhisperGifts. This means our hosting provider has a physical server, upon which multiple websites run.
To improve reliability, our hosting company decided to change the physical location of their servers, including the server that contained the WhisperGifts website and our users' registries. This move was scheduled with us, and was due to happen one afternoon (Melbourne time). Expected downtime was 8 hours.
The first hint we got that something was wrong was when the 8 hour window was complete, and our server wasn't yet back online. An e-mail arrived soon afterwards from the owner of our host, explaining that they had a problem turning the server back on. In their attempts to troubleshoot the problem, they accidentally wiped the core startup details of that server.
I decided to give them the benefit of the doubt, and gave them time to get WhisperGifts back online.
By the next evening (28th May, at 8pm: 24 hours after we were originally meant to be back online) I had decided to move away from our old host. I configured a new virtual server with Rackspace, and begun the process of restoring backups & migrating WhisperGifts to the Rackspace infrastructure.
This restore took some time, and it highlighted a few issues in my fresh-server deployment process. I've updated the process so that I can now bring a new server online within an hour, if necessary.
We were back online on the morning of May 29th, after being down for two days. I left the website running overnight, watching for any problems. By morning I was satisfied that everything was working correctly, so published a Facebook message at 7:45am notifying the public that we were back in business.
How did we let this happen?
The small hosting company we used was considered "self service". For the most part this isn't a problem, as I've got the skills necessary to keep the virtual server online. It also saves us money, as we don't need to pay our hosting company to maintain anything other than the 'bare metal' server. Unfortunately it also means that our hosting package didn't come with many things that other web hosts provide, such as easy online backups within the web hosts' infrastructure and easy migration of virtual servers between multiple physical servers.
Most of the work of the hosting company was done by a single person, supported by a small team. Whilst their skills were fantastic, the lack of breadth to the team meant that they weren't able to do their best work non-stop when things went wrong. I still have massive respect for their skillset and host other websites there - but only websites that don't reflect poorly on my users when things don't work.
We are now hosted with a best-of-breed cloud host that lets us move virtual servers around easily, so we aren't bound to a single physical piece of computer hardware.
What have you changed?
There's a few key things that have changed.
- We now use a large web host that lets us redeploy servers quickly and easily. We can also bring up duplicates of existing servers for redundancy, if required.
- Our backup process was tested, and it worked - but it wasn't perfect. My deployment script and Python requirements file weren't quite up to scratch, so I've fixed them.
- In the future, I'll assume the worst and move servers (if required) ASAP. A full 24 hours was too long to wait to make this decision.
- We will not wait as long to publish details to Facebook, Twitter, and our customer marketing list in any future outage.
It's certainly been an experience for us to go through this process, and again I'm sorry it happened, however I'm positive we've come out of this in a stronger position.
Now, back to the wedding preparations!