A little ways back, Fastly had a global outage. Now, this is rough enough for many B2B apps. But for core infrastructure, it’s even worse. Your customers simply go 100% down. 1000s of them. The lights go out. That’s hyper-stressful for the customer. And for you.
If it hasn’t happened to you yet, it will. And it will happen again and again. This is the nature of web services built on top of 1000s of individual servers and dozens of interconnected services. You can do better. You have to do better. But you can’t 100% stop every memory leak, every DNS issue, etc.
The #1 mistake I see is SaaS companies hiding what happened. I see this again and again:
- “A partial outage …” when everyone or almost everyone was impacted.
- “A subset of our users …” when like no one could log in.
- “Due to a misconfiguration of …” using the passive voice as if there was no mistake or fault.
- No apologies or acknowledgment of impact on customers.
- Hiding a status page
- Hiding regional outages
- Only showing outages longer than 15 minutes
It’s natural. It’s natural to want to hide a bit from bad news, especially angry customers.
But it doesn’t build trust. So I just love how Fastly handled it here:
- Get it up ASAP.. Fastly published the full post-mortem the same day. This sounds obvious, but I see many folks wait, or just never do one.
- Honest acknowledgment of impact. Acknowledged it was “broad and severe” and simply apologized in opening incident report.
- Honest summary of what really happened — including a true detailed timeline. No hiding it was just for a few minutes, or just impacted a “subset” of customers. Fastly succicntly summarized the 3 hours it took to fully mitigate the issue and then how long it took to deploy the fix.
- Honest path forward to do better. Customers want to know you are truly doing something to do better next time. This is the weakest part of the report, but at least it’s there.
- Conclusion that takes clear, direct responsibility. “We should have anticipated it.” That builds trust.
This incident report isn’t perfect, but it’s better than 95% of the ones I see. I feel better after reading it. 90% of the time, I actually feel worse after reading an incident report on a vendor’s website. I see them hiding the ball. And experienced buyers will see it, too.
Do this. In fact, just copy Fastly’s template. It can hurt a bit. It can hurt a lot. But customers know there will be issues. Transparency, at least 90% transparency, builds trust. You want your customers for a decade — or longer.
And also note that your competitors will be all over the outage. They’ll tell your customers even if you don’t. To try to steal them back. Without this trust-building — you’ll be in a weaker position when they go after you. A much weaker position.
A deeper dive on Fastly here if of interest: