It’s been a rough stretch in the cloud. S3 / AWS was down for the better part of an entire day, taking a huge chunk of the Internet with it. This shouldn’t happen, folks. Cloudflare, which runs a significant part of the internet’s traffic, leaked passwords and other PII. Gitlab deleted and lost source code.
These are terrible. Horrible. Unacceptable.
And yet, the internet isn’t there yet. We’re not 100.00000% anywhere.
And if this can happen to Amazon and to Cloudflare, then it will happen to you. With 100% certainty. Multiple times.
You will have a terrible outage. You may lose customer data. You will have security issues. Period.
You’ll probably act wrong the first time. You may hide. You may ignore it, when you are small. You may blame a vendor. You may do some crummy, it sort-of-wasn’t-me “root cause analysis” a few days later. You may claim a “partial outage” has “impacted some customers” or “some data was lost”, when, really it was a total disaster that basically impacted everyone.
Let me share some simple learnings.
If your customers believe in you, if they trust you … and they probably do, or else why would they use some tiny vendor they’ve never heard of … you get One Pass. You can screw one thing up badly. If you acknowledge it, if you are honest, direct, simple and most of all responsive — they’ll give you pass. They want to. They bet on you, after all.
The second time, trust is broken. If you hide a second breach (see e.g., Yahoo!). If you go down a second critical time. If you lose critical data, your existing customer base will no longer trust you. Watch your NPS drop to 0, at least, from a subset of your customers. It will.
But they don’t leave. Because as easy as we claim it is to switch vendors, it’s never easy to change a business processes. Businesses make multi-year commitments to vendors, either literally (in the enterprise and bigger deals) or at least conceptually. Yes, S3 went down. But that doesn’t mean I’m going to switch to Azure. Not yet.
The second time, you’re no longer the kid I’m rooting for. But I’ve committed. So I stick with the vendor.
If nothing else happens for a year or so, trust will slowly be rebuilt. But it will take a year.
But the third time we’re down for a day. The third time you lose my data. The third time I can’t trust you.
I may stay a while — as a Prisoner. But mentally — I’m gone. You don’t see churn the instant the Third Instance happens. It often takes a full year, a renewal cycle. And sometimes, if you are too close to the renewal, it may even take longer than a year.
But the third time, you’ve lost them forever. They’re already making plans to leave you. Even if the plan may take a while to implement.
Segment your NPS and CSAT so at least you have the honest data. So you know. So you aren’t flying blind.
My general recommendations:
- Get a trust.yourapp.com site up ASAP, and make it transparent and real-time, if you don’t do this already.
- Do root-cause analyses quickly, publish them, and importantly — make them (x) honest and (y) succinct. A rambling answer is a sign of someone hiding something. Taking half-responsibility doesn’t work either. Take the blame. Be honest. Be brief. And explain what you are doing so it won’t happen again. It’s your app. So it’s your fault.
- Take a Time-Out once a quarter to talk about all the devops, secops, scaling limits, and other issues you are facing and may face. Talk about if you are taking the right risks. You can’t solve everything overnight. But if don’t talk about and force rank your issues at least once a quarter … it will never be high enough on anyone’s list. More here.
- At least after The Second Time — Make a Change. Bring in a new VP, a new director. Change the way you build software. Whatever it is, by the second time, you have to make a change. Because otherwise — there certainly will be a Third Time. And then — goodnight.