When it comes to seamlessly scaling your applications, a top-notch engineering team will be your foundation. Next comes the decisions to build or buy your infrastructure, DNS, monitoring, and analytics tools. Julian Lemoine, Co-Founder, and CTO of Algolia will share his lessons learned on how to stay focused and innovative as you scale while also avoiding the innovation for innovation’s sake pitfalls.
Want to see more content like this? Join us at SaaStr Annual 2020.
Julien Lemoine | Co-founder and CTO @Algolia
FULL TRANSCRIPT BELOW
So I’m Julien, Co-founder and CTO of Algolia. We develop a search API to help any developer to have a very good search in their application. But today I won’t talk about search at all. I will discuss about build versus buy decision and I will try to cover a few mistake we did about those decision. I actually think I never met any company that have not made a few mistakes about build versus buy and most of the time it’s linked to the long term cost. So long term estimation of what it means to build or to buy your solution. So I will first give you a few numbers about where we are. And when I will give you a few example, I will try to explain where we were at this stage when we did the decision, when we took the decision. Because of course, the stage is important in search decisions.
So today we are a distributed team. We founded the company 2012 in Paris. We then flip the company to the US. We are today around 350 people in six different offices. We have mainly a bare-metal infrastructure. We’ve close to 3,000 server distributed in more than 70 data centers and we are between 200 and 300 billion API call amounts. All of that was not built in a day. It took us a few years and I will give the context for all the examples. Of course, behind this product, we have a super strong team. Like none of that would be possible, we’ve got a very strong engineering and product team.
One of the very misleading concepts for me is that the better your team is, the more difficult it is to take the good build versus buy solution. I’m sure it will be very like misunderstood those. Like probably you won’t see it at first. We all see the positive effects of having a great engineering team. Like, of course, good engineers can build a product way faster than others. Of course, it contribute another big impact, a key impact for your company. But we don’t discuss enough about the negative aspects of a very strong engineering team. I think it’s not something we discuss enough.
First of all, I think any good engineer can easily give you 10 good reason to build instead of buy. It’s super easy, like we never find enough arguments to buy. You can always have a very specific problem. You can always find arguments that you need a custom solution. You need to build it, pretty much. And what I have seen is that most of the time you can discuss, you can argue. They will come with a prototype in two hours or the day after and they will tend to show you that it’s easy to build. And they will always underestimate the longterm costs, the longterm impact of building versus buying. So I will try to discuss with few concrete example and a few concrete mistakes. So some were positive with good effects and some has been negative for us and we have to change. So I will cover them to give you a bit of insight about when we have build and when we have decided to buy.
First big decision we took it was in the very early days of the company, we were only two. We have not even fundraise. We have no salaries, so very, very early days of the company. We decided to use bare-metal infrastructure and not use cloud infrastructure, VMs, which was kind of completely crazy at this moment. And a lot of people misunderstood us when we took this decision. It was of course risky and it was a big bet.
The reason we did this, we took this decision is because of our bet on the market. Like we were working on this field for more than 10 years before creating the company. We were convinced people wanted to have good performance. People were investing a lot of time to have good performance, but they were not reaching a good enough results. So our initial bet was performance could make a big difference. We wanted to have a factor 10 on performance, which means we had to be crazy about performance on all the different aspects. Of course, the software, we need to build something, which is very different in term of performance. But the hardware, even the network play a big role in the performance. Like search engine are very intensive in term of CPU versus the memory. So we had to pretty much select a very specific machine, very specific hardware to make sure we have the best performance of the market. None of that was available on cloud infrastructure at this time.
So that’s why we took this crazy decision to use bare-metal and pretty much we have to build everything from scratch. Like we distribute the data by a factor of three to make sure like we have some high availability. You can do that on cloud provider with availability zone. You have a lot of features out of the box to do that. We have to build everything from scratch in software to be able to have high availability. And of course, at the beginning we did not have the time to do that. So the first version of the software had zero high availability at all. It was one server running on like one provider, and it was on one side kind of risky because one big hardware failure of could cause us a lot of trouble. But it was the easiest way to get the feedback we needed to have. Like we have these big bets, we needed to get some feedback from the market about this bet. Is it realistic? Does performance make such a big impact as we have planned? Or is it difficult to sell the software based only on a big difference in performance?
So to imagine the product at this moment, the feature set of the product was super limited. We didn’t get even 1% of the feature we have today, but we had a big difference in term of performance compared to everything else on the market. And the reaction we get was super positive, pretty much it confirmed our initial bets. So this decision was super risky. It was at a moment where it could have hurt the company like a big failure, a big hardware failure could have hurt us in term of trust. But we were in the initial days and it has been positive. We’ve validated our assumption. One thing solved. Even if we decided to use bare-metal infrastructure, even if we decided to do everything ourself, we did try to lower our investment as much as possible. So of course we use leasing and not co-location. We did not buy ourselves the hardware as someone builds a machine and so on. Like we took some provider to do that for us. Like we use Leaseweb or the Edge, those provider that help us to do that.
Long story short today we have infrastructure on the cloud, and it took us some time to have the good infrastructures, the good hardware in cloud to run our API. Today, there is way more variety of hardware on cloud provider when it was a case in 2012. So this one was a good decision. That’s the only example, positive example I will share with you, but sometime a build versus buy can be positive. It’s not only about like always selecting one, it’s something that can be your big differentiation. It can be your factor 10, like on performance for us, but it could also be a negative decision. And I will try to cover more negative example. So I have three more concrete example of mistake we did.
So second decision, very important decision. We were end of 2014. We were about to launch what we call our distributed search network. So this one is ability to distribute the search in different geo location without any pain for the customers. So we have one API, it’s crucial data. We distribute the search worldwide and then we redirect the user to the closest data center. Every customer can decide which location they want to have to control their cost. Like we have 15-16 region worldwide, but you may not want to use all of them, so you can control your costs.
The geo distribution is done by the DNS. And we were using a solution from Amazon AWS…. We had one big issue first, which was a number of regions they were supporting were not the same vendors. Like we had at the moment of the launch 12 region. Amazon was covering only 7 of them. So we were not able to do the general routing for all our customer or our region. Second problem, which was a performance problem, when we were creating a region, like a specific entry for one customer, it was taking a few minutes. Meaning like in the onboarding flow of our users, it was taking a few minutes to set up the account, which was causing a huge drop in the process.
So we decided to look for all the solution of the market. The engineering team was super, super small at this moment. We were in total four engineers working on the product, including me, including our VP of Engineering, so four in total. So very, very small team. First reaction was to look for existing product. Of course, they were a Wikipedia page listing like all the managed DNS available on the market. So we took them one by one. We contacted all of them, we look at the offer. None of them was able to address our use case. So it was a big disappointment. Like I think they were a list of 20 different providers, none of them has a solution that could suit our use case.
What we did was to select the most advanced and have a product discussion with them. Maybe they don’t have the solution on the market today, but maybe they have it in their road map. None of them was able to give us like their road map or and ETA of when they will be able to address our use case. So we were stuck. As I mentioned, a good engineering team like challenges. So in no time, engineers build a small prototype based on open source and we negotiated with a network provider to build like an any cast network with our 12 region to redirect the traffic to our custom built DNS we’ve based on open source software. So prototype was like I think in two days, working prototype. Of course, not at scale. But then like transforming a small prototype in a product in production, not with 10 API code, but billions of API code is another story.
When we did this studies and where we were having to sign the contract with the AnyCasts provider, we took some time. The reason is that this provider wanted the minimum three years commitment on the solution, which is huge for a small company like us. It was pretty big budget and the budget was so big that we decided to take some time to analyze the situation. By luck, we have met someone in San Francisco that was launching a startup on DNS. So it was a completely random meeting in San Francisco, meeting a small startup building a managed DNS solution. The chance to have something that would fit our use case was super low, but still this small startup was building something in a domain where we had done a lot of analysis. So we had super interesting discussion. And they proposed us to transform the road map and to address our use case in a month.
The company was super young, they were just founded a few months before this discussion. So super early stage. And of course it was risky for us because like we were a bit more advanced. It was key for our products. So relying on a very small startup is a bit risky, but we decided to take the risk. The big reason we took this risk is because the team behind the startup was working on DNS for 20 years. So they were knowing a lot on the domain and actually they make us realize that our custom made solution, we are lacking of a few things that could hurt us. Like we have not anticipated like DDoS on the DNS, which is a big problem and that happened a lot of time. So it could hurt us. We have a lot of things we have not anticipated.
So we decided to use them at the last minute, like maybe a few days before signing like the supplier’s contract with the provider and launching our own solution. And this story does not end at the moment we decided to use the startup because as I mentioned, when your engineers got excited about one problem, one specific problem and there is no solution in the market, they try to think of a different solution. Like “Oh, we could do this specific feature. It would be 10 time better than our product.” Or “We could develop this one, which is also a bit different session.”
So we had kind of long road map of big different session on DNS compared to existing products while we are not expert on DNS, which seems a bit odd. But I think a good engineering team will always find this and they will come to you and give you a big list of things you could do way better than anyone else, even if they have no stong expertise in this market. My understanding is that we always … There is this famous quote. “We always overestimate what we can do in a year and we underestimate what we can do in 10 years.” I think the big mistake here the team was completely overestimating what they could do in a year timeframe. But this has solution, diverse, a lot of learning. All the crazy idea we had actually they developed them.
So we have contracted with them and they give us a lot of advantage of learnings. Like most of their big ideas actually we are not working in our use case. We would have spent a huge effort in R&D to try them and they were not working. With this as products, they did all the investment. We tested it and on our use case it was not working. So the investment was not crazy compared to the results. But we were about to do the mistake. And I think one of the key learning is that finding a provider is super complex. It’s not just listings, going to Wikipedia and list for the providers. It’s looking for the startup and maybe the startup is not existing and the startup will exist in a year.
So it’s also about thinking about complexity to switch to another solution. Like this three years commitment would be crazy. And if today I was in the position to take the decision, I would not go with the three years contract. I will try to find a workaround. Maybe reducing the number of region, maybe 12 was too much. Finding something where I don’t have a three year locking. Like three years is way too much to lock in a situation that can put on surely cost us a lot of money just to maintain. So we were about to do big mistakes and by a lot of chance at the last minute, we discovered this startup and decided to take the risk. Maybe the story would have been very different if we would have developed these DNS. And I think it would have been the case because like it would have took all the R&D to develop something where we will not expert.
Another very similar decision. So this one was way later. We were like certified engineers. We are focusing on performance. So of course, the monitoring of our API, monitoring of performance was key since the beginning. And we were using a SaaS product to monitor our infrastructure. But we hit a few limits in the solution we were using. And those limits were mainly because the solution we are using were designed to monitor a specific machine, whereas our service is distributed across a set of machines and we need to see like some graph distributed on a few machines.
So again we discussed with several providers and we had like even discussion about roadmap, discussion about what could be like a good implementation for our use case. We discussed during a year with the provider we were using and they were still no delivery on what could make an impact for us. While we were discussing with providers, the engineering team one day decided to build a prototype. So they build a prototype internally based on open source using Grafana … Graphite. And in a day, they were able to have something better than our SaaS solution. And I think here maybe the mistake is to think that because in one day you have something which is better than what you have, it’s super easy to have your own solution which does not cost so much money because it was built in a day. And that you can tailor it to your needs.
I think there is a few drawbacks in this conclusion. The first one is that it’s a prototype in one day. It was deployed on one machine. Where we had at this moment like 200 or 300 machines. So just taking this solution making it scale at 300 machines is not a matter of hours. So first big drawback is about like limitation. Then, even if it was better, when we were looking at what we’re missing in this solution, the list was huge. Like we had dozens, if it was not 100s of missing feature for our needs. So it means our team would have to develop all those features for our performance. And it means like my road map would be like putting some emphasis on those metrics which are not game changing for the business. It’s not something which is exposed to customers. It’s something for us to tweak the performance of our service.
And I think that’s again a drawback. Like thinking short term, building something in a day, which is great. It’s a good engineering effort. But then on the long term, like putting our investment of something, which is not game changing for our business, I prefer to pay for an external service, as over taking my best engineer and making them work on this aspect.
And I did mistake at this moment because of the previous story about DNS, I tried to convince them to use an external solution. So I tried to discuss with them, argue that the long term cost would be like super, super expensive and we have to select something more efficient. I never managed to convince them. It’s pretty much one argument against another and if you discuss about one specific thing, they will come with another prototype and it’s an endless discussion. And you will never manage to convince them. Again, good engineers, they have 10 different ways to explain you, but the best solution is to build and not buy.
I think at least one thing which is good with smart engineers, they learn quickly, and they still learned from this previous experience with DNS. I never managed to convince them, but one day, the lead engineer discover a startup and we had exactly the same pattern. Like discover a startup, building a super good solution on monitoring. They build a small prototype, they convinced the team and we use as solution. But we have this long intermediate state with open source developed and we have put some effort on open source. With setback, even this six months investment on open source was a lost of time. Like during this amount, we never put the open source in production. We never had like a business impact because it was just some investigation. Like using our previous provider, trying to push them a bit in our direction, even if they were not able to execute our vision of monitoring would have been way better than spending our time building prototypes.
So again, small team, even if it was 30 people, it’s way bigger than four. But if you have three, four people in this team focusing their time on building a solution on open source, and then you argue and they spend more time and it’s endless. Basically you lose your R&D on something which is not game changing.
So what I did learn in this story is never try to convince them, but challenge them, challenge them on the cost of developing our prototype, challenge them on the fact we cannot use our existing monitoring solution for six months to a year while we find a good solution on the market. Challenge them on every single decision, but do not try to convince them. It will never work, I think. My personal feedback is that you will never convince them, but you can challenge them and they will look at the problem differently. … don’t give them like pretty much an order.
So for this one again, we took the decision to build internally, which was a bad decision. During six months, we build those prototype, we did some iteration, I tried to convince them decision was not a good one. And a good solution comes six months later from new developer because they found a good startup using the good product. And I have another thing, which is funny in the story. Like the engineering team recognized that they would never have been able to develop the solutions they discovered. And I think that the big learning with SaaS, your internal team will never be able to compete with a SaaS product. Never. It’s hopeless … If the solution is successful, the team will scale and maybe today, they have two engineers, but in two years they will have 50. Why you will stay with like a limited effort and only two engineer working on it? So this solution was so good that they were convinced in a week that they would never be able to do the same thing. But again, they took the decision to build internally.
The last example I wanted to cover is a bit more complex. It’s a decision we redo several time. So this one is about our analytics solution. So on every search API call of our users, we monitor all the API code and we give them some statistics. So we did this analysis of all time. When we created those products, we did the first version of the analytics 200 line of QB code, 1 hour of development, super good investment. And these analytics is also business critical because we use it for the billing. Like we count the number of API codes and that’s what we use for the billing.
But like beginning of the company, we knew the solution was not scaling 200 line of code, but it was perfect. It was not even worth to look at the existing solution of the market. In one hour, it was solved. But we knew it was not something that could scale. A few months later with the growth, not even a year after, we have to redo this part, which was expected. We knew it was not able to scale and we give the engineer … Like one engineer’s autonomy to look at the problem, but not at the market. What we give him was a mission to replace this component that was not scaling. So what he did, he took a better programming language. Instead of QB to Java, which is better in terms of performance. Instead of 200 line of code in one spread, he multiplies the number of spreads so he give us some autonomy.
This development was good in the sense that it was able to run 100 times the load of the previous application and it give us like two years. During two years, we were able to use this software. But then again the software was not scaling. We look at the solution on the market and again kind of the similar solution. Nothing was perfectly a good for our problem and especially our scale was so big, we were already processing billions of API code that the price was too high for us. We would lose money by using a provider on analytics.
We did a mistake at this moment which is two. Okay, it’s not possible for the pricing. Let’s look for one of our solution. We did use at the end Nightbreed solution, which is a mix of APIs plus custom code and we use the GCP platform for that with … Data Flow, a lot of different products. But it took us close to a year to develop it. We reach a lot of problem, a lot of bugs like those API have never been used at these scales. We had a lot of problems that we are not anticipated. But during this time, we already reached the limit of the previous program.
So one of our best engineer during close to a year has to do his best to maintain the solution that was questioned several times a day. Spending a lot of time just on maintaining a solution, which is not good. At the end with some step back, we should have paid for the very expensive solution for year. We would have used money for sure, but I would have gained the time of one of my best engineer that could have worked on something more critical, but maintaining a solution that was supposed to die. So that’s another mistake. Like looking at the coast maybe too early and spending some money for a limited period of time would basically buy us some time and the time of one of our best engineer, which is a big thing. Like we never have too many engineers. I have never found anyone that have too many engineers.
So at the end, those decision are critical for the business. Like those decision are pretty much inventing the wheel. Like that’s pretty common sometimes we hear in engineering and we always do mistakes. My feedback is that it’s super difficult to don’t do any mistakes. We all do mistakes and I think we have to deal with it. And one of the best way to deal with it is to accept we can replace components easily over time and don’t think about building something for the long term. Like the cost of maintaining a solution can be super, super high.
So how to do not try to reinvent the wheel. I think one of the thing where I have the most proud of is probably that our engineers have learned a lot. Like those engineer have learned way faster than I personally learned in my career. Because they saw all those mistakes very quickly in a fast growing environment and I’m super proud of what they … Like they become better leaders, engineering leaders. So I think some very strong experience in your team. People that have a lot of experience did the mistakes several time, I think it’s one of the best way to avoid that.
Then of course, thinking about keeping a solution for the short term. I have some very funny story about that, but I work in a big group that in the ’80s developed a specific software, two versions of source code. This software was great. Today they still use it. This software is 10 time worst but any solution of the market and they have a world team to maintain it. So 30 years later, they still maintain their internal choice and they are not able to migrate. So the earlier you can move from something you build to an external solution, the earlier you can keep your focus. And the speed of iteration, the focus I think is a critical element in any engineering team. Like the speed of iteration is what we make the difference. And I think one of the only good reason for the long term to build is you have a factor 10. Factor 10 of your product, factor 10 on anything then it’s a good weight and good reason to build.
So that’s it. I think we have one minute for question. So first question: Which companies should not be doing infrastructure on their own? I think most companies. I think it’s really an exception to build infrastructure on your own. I think 99% of the company or 99.99% should not build the infrastructure. I think you need to have a big bet and it needs to be a huge difference in your product, in your company, in your business to have a good reason to build it. I think if we have discovered that the effect of the performance was not to be differentiation, we would have moved to cloud. So I think most companies should not build their infrastructure.
Another question: How do you navigate the make versus buy decision with multiple Big Data software vendor who have more offering than ever? So the complexity of is definitely something which is an issue. I think when you evaluate all the solutions, I never want my team to think in term of features. I think you have the product with the feature set as it is today, but if you use and you buy your solution, you need to project your decision on the long term. So looking at the road map, discuss the road map with the provider when it’s strategic, looking at what the product will be in a year, in two year. Discussing is their vision of the market and if it can help you I think is a very good way to make the big difference and move out of the feature list comparison of the complexity of the solution. And another thing is looking at the integration cost. If it take you two years to migrate to a solution, then it’s an issue.
Another question: What do you think is the main benefits of internal build over vendor solution? I think there is only one for me, which is a factor 10. Like if you have this factor 10 on one specific area, which is important for your business, it can be the UX, it can be the performance, it can be the relevance, it can be anything. If you have this factor 10 and it’s something you can measure. If it’s your strong difference on the market, then it makes sense to own it. If it’s not your factor 10 and if it’s more a feature of your product, then I think it does not make sense. If it’s your work product, your work differentiation, I would recommend to build. If it’s not, then just buy. And find a good provider.