How @gorgiasio went from coding to production-ready AI agents for 16,000+ customers … in 6 months with CTO @humanfromearth:
1⃣ 50 customers in beta
2⃣ Wrote playbook for successful activation
3⃣ Shipped most common actions first (Shopify, etc) pic.twitter.com/dBMSaNh9q8— Jason ✨👾SaaStr 2025 is May 13-15✨ Lemkin (@jasonlk) January 25, 2025
Since its founding in 2015, Gorgias has grown into a significant player in the customer service automation space, backed by $100 million in funding (including SaaStr Fund leading its seed round), almost $100m in ARR, and powered by a team of 300 people. Everything changed in 2023 as automation became the focus, and then again in 2024 as full AI agents were rolled out across its customer base.
Their rollout of AI customer service agents across 500+ top brands offers valuable insights into the future of automated customer support.
Here’s the story – and deep dive.
5 Non-Obvious Learnings from the Trenches:
Through their journey of deploying AI customer service at scale, Gorgias discovered several surprising insights that aren’t commonly discussed:
- Bigger Isn’t Always Better with AI Models Contrary to popular belief, larger language models don’t necessarily perform better for customer service. Gorgias found that splitting context into specialized pieces (routing, summarization, generation, and arbitration) often outperforms a single large model. For chat interactions, they actually use smaller models to meet latency requirements without sacrificing quality.
- The First Error is Often the Last Chance One of the most critical discoveries was that customers who experience a single significant error with AI automation rarely give it a second chance. This makes the cost of switching to a competitor particularly high and emphasizes why having robust safety checks is more important than rapid deployment.
- Prompt Engineers Don’t Need to Be Technical Some of Gorgias’ most effective prompt engineers came from customer service backgrounds rather than technical roles. Their deep understanding of customer interactions and support scenarios proved more valuable than technical expertise when it comes to crafting effective AI responses.
- Auto QA Changes the Game Using AI to evaluate AI turned out to be surprisingly effective. Gorgias developed an “Auto QA” system that uses large language models to analyze customer reactions, evaluating sentiment, satisfaction, and resolution. This transformed how they track and improve performance, shifting from manual review to automated analysis.
- Channel-Specific Optimization is Crucial Different communication channels require fundamentally different approaches. Email can handle longer processing times and more complex models, while chat requires near-instant responses. This meant developing channel-specific prompts, orchestration, and model selections rather than using a one-size-fits-all approach.
Most importantly, the market for AI customer service is still in its early stages. The technology improves every month, and the companies investing now are positioning themselves for significant advantages in the years to come.
The Path to AI Customer Service
When Gorgias began developing their AI customer service solution, they knew they needed to balance ambition with practicality. They started with a focused alpha phase, working closely with just 10 carefully selected brands. This tight feedback loop proved crucial, allowing them to iterate quickly and build a solution that actually solved real customer problems.
The development process took three months of intensive work before they had an alpha version ready for testing. From there, they expanded to a beta phase with 50 brands, using the insights gained to create a comprehensive playbook for customer activation. Six months after starting, they were ready for general availability.
What Makes It Work
The success of Gorgias’ AI agents relies on several key components:
First, there’s the data foundation. The AI agents pull from multiple sources: help center articles, customer order information, macros (predefined response templates), and integration with platforms like Shopify, Recharge, and Loop. This comprehensive data access allows the agents to handle complex customer inquiries effectively.
Second, they built robust control systems. Brands can customize their AI agents’ tone of voice, filter and organize metadata, and set exclusion topics for sensitive issues. A preview mode lets human agents validate AI responses before they go live, while safety sampling allows for gradual rollout of automation.
Third, they invested heavily in the technical infrastructure. A team of 20 engineers, including five focused specifically on AI orchestration, works continuously to improve the system. They employ multiple specialized AI models for different tasks and maintain a separate validation model to ensure safety and accuracy.
Real Results and Challenges
The numbers tell an interesting story. Across their 500+ brands using AI agents, Gorgias sees average automation rates of around 10%, with top performers achieving 30% automation. Even more impressive, their A/B testing shows a 5% increase in gross merchandise value (GMV) when AI agents are deployed effectively.
However, these results didn’t come easily. The team faced several significant challenges:
The infrastructure for building and deploying AI agents is still immature. Testing, evaluation, orchestration, and observability all required significant custom development. Different channels (email vs. chat) needed specialized approaches, with chat requiring faster, lighter models to meet latency requirements.
Customer expectations also proved challenging. Today’s businesses expect full automation – not just answering questions, but performing actions like processing returns or updating order information. This requires careful integration with external systems and robust safety checks to prevent errors.
The Future of AI Customer Service
Looking ahead, several trends are becoming clear. AI models are getting faster and cheaper, while open-source alternatives are becoming increasingly viable. The role of human agents is evolving from handling individual tickets to analyzing patterns and optimizing automation.
Gorgias has developed a structured approach to help brands succeed with AI customer service. Their “30 in 30” program aims to achieve 30% automation within 30 days, starting with basic help center content and progressively adding more complex capabilities. They’ve also created an ROI calculator that helps brands understand the potential cost savings from automation.
Deep Dive:
Controlling AI Agents and Addressing Errors
- To control AI agents, three key aspects are considered: tone of voice, filtering and rearranging metadata, and exclusion topics, allowing customers to control how the AI agent speaks and responds to certain topics.
- Filtering metadata is crucial to ensure the AI agent processes information correctly, making its life easier to function properly.
- Exclusion topics enable customers to control which topics the AI agent should not respond to, such as threats, allergy claims, or medical-related issues, and instead escalate them directly.
- A feature called “guidance” is used to feed relevant articles into the AI agent, providing private and internal information that is not exposed to customers.
- Macros, which are used by human agents, are a good source of information to feed into the AI agent, and actions, such as integrating third-party APIs, are also essential for full automation.
- Actions require strict filters, especially for dangerous actions involving money, and need to be integrated with the AI agent for full automation.
- To address concerns about errors, features such as exclusion topics, preview mode, and safety sampling have been developed, allowing customers to roll out the AI agent in a safe and controlled manner.
- Preview mode creates a draft that human agents can validate and change, while safety sampling sends only a percentage of traffic to the AI agent before rolling it out progressively.
- The “playground” feature allows customers to test different scenarios without going into production, enabling them to identify and address potential issues such as hallucinations and factual errors.
Improving AI Agent Performance and Rollout Strategies
- Mitigating the limitations of AI agents can be achieved by specializing LLM prompts, investing in a test pipeline, and using evaluation data sets, which can significantly reduce errors if done carefully.
- Rolling out AI agents to cover 100% of traffic can lead to increased costs compared to co-pilot implementations, which are more on-demand.
- The key to successful AI agent implementation lies in balancing precision and recall, as being too safe or too aggressive can lead to errors and extra work for customers.
Expectations and Performance of AI Agents
- The expectation for AI agents is higher than for human customer support, requiring them to be better and cheaper.
- Larger models do not necessarily perform better, and splitting context into multiple specialized pieces, such as routing, summarization, generation, and arbitration, can improve performance.
- The arbitration step, which validates responses before they are sent to customers, is crucial in ensuring that AI agents respond like human support agents.
Building AI Agents: Accuracy vs. Forgiveness
- Businesses must decide what they value more, accuracy or forgiveness, when building AI agents, as this will impact the agent’s design and functionality.
External Actions and Automation Challenges
- Retrieval methods, such as RCK versus fine-tuning, can be useful for specific use cases, but may not be necessary for most applications.
- The default expectation for businesses today is full automation, and the ability to perform external actions is key in customer service, as simply responding to questions is only half of the work.
- External actions are important, but there are challenges associated with them, such as the need for strict filters outside of large language models to ensure safety and prevent errors.
- To avoid errors, it’s essential to have filters in place, such as checking the LTV of a customer before performing an action, which is processed outside the large language model.
- The biggest fear is that customers will try the automation, experience one error, and then never try it again, resulting in a high cost of switching to a competitor.
- Advanced actions may require calling multiple APIs, retrieving information, and making decisions based on that information, which can become complex and resemble a workflow with multiple steps.
- The complexity of building a useful AI product can be underestimated, and it takes longer than expected to build a product that meets expectations.
Rollout and Initial Results
- Despite the challenges, it’s worth pushing forward, and good products take time to develop, with many different components involved in the process.
- The rollout of AI agents started with an alpha version, which took 3 months to develop, and then a more production-ready version was created with the help of 10 eager brands who provided tight feedback.
- To ensure timely feedback, the brands were told that if they didn’t respond quickly, they would be replaced, which helped to get the feedback needed to meet the 6-month timeline for general availability.
- The initial results of the AI agent rollout showed 10% full automation, with 30% average automation and 30% automation for the best customers, with only help center articles and no actions available at the time.
- The beta phase involved 50 brands and the creation of a playbook for activation, which was a collaborative effort between the product marketing team, success team, and others.
- The playbook was useful in the success of the rollout and it is recommended to involve the teams that will be onboarding customers and selling the product early on.
- The rollout continued with the addition of more data sources, such as macros, public pages, and ecosystem integrations like Shopify, Recharge, and Loop.
- The rollout also introduced basic HTTP actions and went from sampling to 100% rollout for email.
- The actual release of the AI agent coincided with the company’s first user conference, which generated buzz on social media and changed the website to B-first.
- The current results show around 500 brands using the AI agent, with around 10% automation and 40% top-line growth.
Key Learnings and Future of AI Agents
- The main learnings from the rollout are that aligning with everyone in the business takes time and that there will be pushback, so it’s essential to spend an inordinate amount of time ensuring everyone is pushing in the right direction.
- The key takeaway is that full automation is the new expectation, and it takes time and resources to build, but the cost will go down with advancements in technology and infrastructure.
- The development of successful large language models is still in its early stages, and infrastructure tooling is not yet good, with a need for better testing, evaluation, orchestration, and observability.
- The development of software using AI agents is a relatively new approach, and the tooling and infrastructure behind it are not yet mature, making things slower and resulting in limited market options for GPT 4 level models at least
- Faster and cheaper models with the same level of accuracy are preferred, and waiting for more advanced models like GPT 6 or 7 is not necessary.
Managing Tech Debt and Innovation
- Handling tech debt versus innovation is challenging, and it’s essential to align the organization internally by allocating specific resources for innovation and communicating the plan to everyone.
- Aligning at the top and allocating resources for innovation, such as 30%, is crucial, and it’s essential to have a clear discussion about what can be delivered with the allocated resources.
- On the engineering side, it’s necessary to paint a picture of the real use case and the expected results, and to be willing to sacrifice some other things to deliver the innovation.
- Creating a sense of urgency, such as a deadline for a critical event like Black Friday, can help drive the team to deliver on time.
User Reactions and Disclosure
- End users’ reactions to AI agents popping up to help convert a sale at a key moment in the sales funnel can be negative, with some users aborting the session when an agent intervenes, as observed in live sessions on an e-commerce platform.
- The decision to disclose whether a customer is interacting with a bot or a human is left to the customer, but it is recommended to disclose that it’s an AI and not a human, as people generally don’t care as long as their issue is solved.
Sales Impact and Future Iterations
- On average, the use of AI agents has been able to influence around a 5% GMV uplift in e-commerce sales, as tested by an A/B test where 50% of customers received a popup and the other 50% did not.
- A chat bubble is considered a more friendly and less intrusive way to interact with customers compared to traditional popups, and it has been found to increase sales.
- The next iteration of the AI agents is to have a conversation with an AI that is trained to do sales when the chat bubble comes up.
- To protect against chatbots going haywire and responding inappropriately to certain questions, a validation step is in place to ensure that the response looks like a typical customer service response given the guidelines of the customer.
- If a question is asked that is outside of the guidelines, the response will be blocked and escalated to a human agent.
- The company uses a separate LLM for this validation step to keep it isolated and unbiased from the rest of the pipeline.
Pricing and Model Management
- The company charges customers based on success, only charging for full automation when a ticket is closed and resolved.
- The models used by the company are changed often, with different steps having different models, and the team tests something new every week.
- The ability to roll out AI agents to 16,000 small and medium-sized business (SMB) brands is attributed to the investment in an evaluation pipeline, which includes data sets for each step and utilizes various tools such as GT4, mini, or Sonet, depending on the specific needs.
- A safety net is in place when deploying AI agents, allowing for testing before they go into production, which helps to mitigate potential risks.
Team and Resources
- The resources required to achieve the three-month and six-month data milestones, and eventually general availability, included a team of around 20 engineers working on the project in various capacities.
- The team consisted of engineers working on data sources, a team of five focused on AI orchestration, and a prompt engineer who configured prompts and created data sets.
- Roles were shifted within the team, with some technical personnel taking on new responsibilities, such as the prompt engineer, who came from a customer service background.
- Each member of the exact team was assigned a customer to onboard and report on weekly, which helped to propagate success stories and feedback throughout the team.
- Collaboration between teams, including the platform team, was crucial in ensuring alignment and successful integration of the AI agents.
- The initial alpha phase started with a few engineers and grew to around 20, all working together to achieve the goal.
Pricing Strategies and Customer Onboarding
- Pricing strategies, such as usage-based pricing, can be effective in aligning with the cost structure and resonating with small customers, as it makes sense that they pay more when they use the product more and less when they use it less.
- Gorgias works with big brands and offers packages with separate pricings for different products, such as automation, voice, SMS, and convert, allowing customers to control costs and predict expenses based on historical data or traffic in their support size.
- The company’s approach is usage-based, but with a limit, so the AI agent stops working after a certain point, and customers can predict how much they’ll need based on their data.
- To implement the AI agent, customers need to create guidances and a health center, and onboard the system, which is not just a matter of flipping a switch.
Personalization and Performance
- Gorgias balances personalization and performance by leveraging the similarities in use cases among e-commerce brands, such as returns and order tracking, to deliver a good experience without requiring extensive customer involvement.
- The company uses a technique of having multiple specialized AI agents, each handling specific types of questions, such as healthcare, to improve performance and personalization.
- This approach allows for better handling of nuances in different verticals, such as healthcare, and enables the company to deliver more effective solutions.
Channel Optimization and Latency
- The company does not currently support SMS as a channel for automation, but they are releasing a new feature that might allow for partnership on this in the future.
- Different channels require adjustments to the prompt, orchestration, or models to achieve the same level of performance, with chat requiring more sophisticated models due to latency expectations.
- To address latency, the company uses smaller models for chat, such as “mini” instead of “turbo”, which has also reduced costs on the email side.
- The company informs customers that cheaper models are used for chat due to latency needs, but expects this to be a temporary solution as technology improves.
Tools and Data Management
- The company uses the Tool “prompt layer” for template registry, evaluation, and A/B testing, and chose this vendor due to their eagerness to develop and mature technology.
- To avoid overfitting, the company keeps their data fresh by adding new data and representative samples to the dataset, and includes a trace ID in every response to track and address bugs.
- The process of creating and managing data sets for AI agents involves a lot of manual labor, which can be time-consuming and labor-intensive.
- To scale this process, the team is increasing the number of prompt engineers, but currently, they are doing it manually to ensure affordability.
Customer Playbook and Success Measurement
- The company has created a Playbook that provides a DIY guide for customers to build and understand how to measure their success with AI agents.
- The Playbook has two versions: one for customers and one for internal use, with different levels of detail.
- The key performance indicator (KPI) for measuring success with the AI agent product is automation rate, which is measured by the percentage of tickets closed by the AI agent within 72 hours.
- A ticket is considered closed if the AI agent responds, 72 hours pass, and the customer does not respond or responds with a “thank you” message.
Role of a CTO and Technology Leadership
- As a CTO, having deep knowledge of technology is crucial, especially as the team scales, but it’s also important to be a good manager and balance technical expertise with people management skills.
- A CTO should aim to be both a technical expert and a good manager, with a balance between the two.
- A technology leader should understand the top business needs and communicate them clearly to the team, focusing on the top three things the business requires, and then identify the top technology needs to support those business needs.
- The technology leader should prioritize their involvement in certain areas and be hands-off in others, depending on the stage of the business and its evolving needs, and communicate this to the team.
- The technology leader should provide a playbook on how to use them and when to involve them in decision-making, to ensure the team knows what’s important and what’s not.
Customer Implementation and ROI
- A typical implementation with a customer involves a 30-day onboarding program called “30 in 30,” which aims to achieve 30% automation in 30 days, requiring collaboration and commitment from both the customer and the company.
- The company uses a calculator to measure ROI, which estimates the time saved and the resulting cost savings for the customer, allowing them to plug in their own numbers and see the potential benefits.
- The amount of customer data needed to train the AI agent depends on the specific requirements of the customer, but the company works closely with the customer to determine the necessary data and build the automation together.
Achieving Higher Automation and Data Requirements
- To achieve 40% automation, it’s essential to reserve time for in-person interactions, especially for high-value customers, and focus on specific use cases, such as returns, to improve automation incrementally.
- The amount of customer data needed to train AI agents depends on the use case, and it’s possible to start with simple data sources like help center articles, which can already provide 10% automation.
- Developing an intuition about how AI agents work is crucial, and it’s recommended to start with small milestones, such as 10% automation, and gradually increase the target.
- Many customer questions can be answered using public information available about the company, making it easier to automate responses.
Open LLMs and Vendor Agnosticism
- If open LLMs become unavailable, workarounds include fine-tuning existing models, which can still achieve good results, and exploring open-source alternatives that are catching up with proprietary models.
- The goal is to be vendor-agnostic and choose the best-performing model, regardless of the vendor, with a focus on accuracy rather than recall.
- The cost of AI models is decreasing, making it more accessible to use high-performance models, and open-source alternatives are becoming more viable.
Co-pilot vs. Full Automation and Error Tracking
- The decision between using co-pilot versus full automation is an art that can be made into a science by using leading indicators from a data perspective.
- To track error rates, a metric called Auto QA is being developed, which uses large language models to analyze customer reactions to responses and evaluate sentiment, satisfaction, resolution, and grammar 50:50.
- The Auto QA metric helps identify areas where AI managers should focus their attention, such as tickets that require review, and provides a way to prioritize and debug issues.
- The metric can be used to create graphs and interfaces for support agents to review and correct issues, changing their role from responding to every ticket to analyzing statistics and drilling down into specific problems.
- Support agents will use the Auto QA metric to identify the source of problems, correct them, and update public information, making their job more focused on analysis and correction rather than responding to every ticket.
- The use of large language models for Auto QA is beneficial not only for generating text but also for classifying and evaluating customer responses.
- The development of Auto QA is part of the company’s efforts to improve the efficiency and effectiveness of its AI-powered support system.

