Below is a summary of yesterday’s Amazon Web Services (AWS) outage that caused our InvestorView client portal to be offline for several hours yesterday.
Valued Customer –
We experienced a full outage this morning around 10:05am PST. This outage lasted until approx 2:55pm PST when full service was restored. At this time the app is fully operational and accessible. I’m writing to share some details around this event as we know it impacted your operations.
Around 9:40am PST we noticed site performance degredation & increased error rates. At this point we started monitoring the site closer to understand the scope/impact. At 10:08 we traced this to a significant outage event for Amazon Web Services (AWS). AWS communicated to their customers around 10:15amthat they were experiencing “increased error rates” with their storage service (S3). S3 is not only heavily used by (and a critical component of) Modestspark, but it is widely used by other applications (Netflix, Slack.com, Trello, Quora, etc.) as vital “plumbing” for Internet services. In addition to S3 issues, other core Amazon services (CDN, emailer, etc.) were offline – this crippled our ability to communicate out the issue to you sooner.
Unfortunately, as we learned today, built in redundancies that we thought were available from AWS did NOT work as advertised and their outage caused our site to become unavailable. To top things off, Amazon did a horrible job of communicating this outage to their customers – this kept us in the dark for longer than we would have liked and we’re keeping our fingers crossed that Amazon does a better job in the future communicating outages/issues like this so we can act accordingly.
As you’ll see below, our biggest takeaways from this incident will be relying less on 3rd party providers like Amazon for our communication serivces. Not being able to get the message out was frustrating for us as I’m sure it was for you not knowing what was going on!
We’ve experienced 99.98% (or greater) uptime consistently over the past 4 years and when an incident like this happens, we take it serious. We’re going to learn from it, do better and take steps to ensure that we don’t have a situation like this in the future.
What we plan to do:
- Develop a status page that does NOT rely on AWS in any way. We have one built but unfortuntely we discovered it had a tight coupling with certain S3 assets
- Develop a mass emailer (for Admin notification) that does NOT rely on Amazon S3 or AWS email services. we have one built but discovered it also has a dependency with AWS services (S3, SES)
- Use Twitter to communicate outages (@modestspark) . Most of you have never checked our Twitter feed, but I’d recommend following @modestspark for this reason. We’ll use it as a channel to communicate events like this in the future should they occur.
We take outages like this seriously as we know that you’ve put your confidence in us to deliver an great Client Experience that needs to be availble when your Client’s want access. You have our commitment that, while we can’t eliminate events like this, we will work to minimize any occurance in the future.
Again, our apologies for this disruption in service.