Did you happen to notice that an AWS outage took down Pinterest, Airbnb, Foursquare and a ton of other web leaders the other day?
The root cause analysis: a memory leak. = Cascading failures. = Doesn’t matter how load balanced you are, it’s all going down, b/c you’ll never keep up with “normal request handling processes” as more and more requests go to fewer and fewer servers.
I mean, I get the root cause — but can’t we have all this solved by now?
I think the single biggest waste in SaaS in particular is the huge investment in TechOps. Not only don’t the customers don’t appreciate it or see much value, but the people and engineering and soft costs are very high.
To make TechOps truly work in SaaS today you need:
- At least two SSAE-16 Data Centers or similar, with full real-time replication working across them
- A techops team that can scale 24x7x365 worldwide (sorry, 2 guys won’t cut it if they need to sleep)
- The ability to respond in real-time not just to simple hardware issues, but to “software” problems such as memory leaks (see AWS above) and others that can also bring down your site
- An endless number of automated monitors, many you’ll have to build yourself (update: fortunately, now we have everything from New Relic to PagerDuty).
And so much more.
But it gets worse in SaaS because nothing really works perfectly in these multitenant, single-database environments. And because if you go down in SaaS even for a few minutes — you really let your customers down.
It is hard. E.g., that second data center? Is it really a full, real-time logical replication of your primary data center? I.e., does it have the same 100 servers, all running in real-time, so you can fail-over at any moment? If it’s fewer servers, is it really going to work in the real world? If it’s virtual, can you really spool up all those severs in 5 minutes or less? I highly doubt it.
And seriously – if Salesforce’s Sandbox can go down for days at a time — do you really think you can do better? And that’s with a world-class, global techops team.
My point isn’t to get into the details or criticize. I’m not an engineer or even close. But I do understand some of the issues.
And it’s time for them to be solved. This is what Force.com tried to solve, but was too narrow.
SaaS entrepreneurs shouldn’t need a TechOps team until they hit $20m in revenue. I’m willing to write a piece of the Series A check to whoever can really fully solve this problem so that TechOps becomes a side issue.