Payroll compliance is a business where one wrong answer can cost $250,000. Papaya Global runs it across 160 countries, and they kept losing a quiet battle. At 2am, when a client has a real question, like whether they can terminate a worker in Germany, the client does not open Papaya’s knowledge base. They open ChatGPT. It answers in seconds, sounds completely sure of itself, and is sometimes completely wrong.
That is the problem Papaya’s VP of Client Success, Sivanne Fishel, and Head of Product Design, Hagit Ben-Tzur, came to SaaStr AI Annual 2026 to solve. Their session was titled “Your AI agent will get you sued. Here’s how we fixed ours,” and it is a clinic in what it takes to ship an agent in a domain where being wrong is expensive.
Start with the part most founders miss. Papaya’s competition was never another payroll vendor. It was a free chatbot the client already had open in another tab and already trusted enough to ask. Every B2B company selling into a regulated workflow is now in this position. Your customers are running your domain through a general model whether you build anything or not. The only question is whether you can be more trustworthy than that model at 2am, on the exact question that carries the liability.
And the lesson underneath the whole build is the one most teams get backwards. Papaya had a working agent in four weeks. Getting their clients to trust it enough to stop opening ChatGPT took four months. The build was the easy part.
The test that started it
Ben-Tzur started simple. She took a real Brazilian employment contract, a CLT contract, and handed it to Claude. Then she handed the same contract, with the same question, to ChatGPT.
Both looked confident. Both gave completely different answers. When she checked both against the actual law, neither got it fully right.
This is the trap every vertical AI founder walks into: assume the model plus your data equals a product. The Brazilian contract proved the opposite. A general model does not know what it does not know about your domain, and it will say the wrong thing with total conviction. The problem was not which model she picked. The problem was that nobody had taught any of the models to think about compliance. So she stopped asking which AI was better and started asking why they were wrong.
Turn every failure into a rule
Every time she found a mistake, she wrote a rule. Rule four: do not guess jurisdiction. Rule eight: do not sound confident if you are unsure. Rule eighteen: do not flag an issue without citing the right law. One at a time, over weeks, until there were 22 rules.
What she was actually building was an eval-driven rules library, and that library is the part competitors cannot shortcut. Each rule is a piece of institutional knowledge made executable, the kind of judgment a senior compliance lawyer carries in their head and a model has no access to. The library compounds. Every new failure Papaya catches makes the product permanently better, and a competitor starting today has zero of those rules. This is what “domain expertise as a moat” looks like in practice. It is not a slide. It is 22 specific corrections, each one earned from a specific way the model got it wrong.
The difference showed up the moment she ran the same Brazilian contract through all three:

Same document, same question. The gap between the general models and the domain agent was not intelligence. It was knowing what to look for and what to ignore.
One model was not enough, so they built a second to check the first
Even with 22 rules, the agent still made mistakes. Fewer, but enough to matter when a wrong answer carries a six-figure liability. Rules fix what the model knows. They do not fix how the model behaves, and the most dangerous behavior in compliance is confident wrongness.
So Papaya built a second AI whose only job is to check the first one. The result is a three-stage pipeline that mirrors how a law firm actually works:
- The analyst. Applies all 22 compliance rules, jurisdiction-specific and fact-bound, the way a junior lawyer drafts.
- The reviewer. A separate AI with its own set of checks that catches overconfidence, false uncertainty, and jurisdiction mixing, the way a senior reviews the junior’s work.
- The finalizer. Merges the corrections, structures the output, and ships it, the way a partner signs off.
The architecture is worth naming because it generalizes: generation, adversarial review, then synthesis. A single model grading its own work inherits its own blind spots. A second model with a different job catches the meta-failures the first one cannot see in itself.
Ben-Tzur’s framing for the whole thing: the rules made it accurate, the validation layer made it reliable, and the UX made it good. Three separate problems, and the model only touches the first.
Built with no engineers and no UX designers
Ben-Tzur built Papaya 1 without engineers and without UX designers. For anyone vibe coding in production, this is the part to study.
In the first phase, she ran design exploration in Claude, then moved to Claude Code, then ran a back-and-forth between Claude Code, Figma’s MCP, and Papaya’s design system. Once it looked right, she connected it to Lovable for the live prototype and deploy, with Supabase handling authentication, database, and edge functions. In the second phase, Claude’s design tooling replaced Figma entirely, and the loop tightened to Claude design, to Claude Code, to Lovable.
The strategic point is bigger than the tool list. When the cost of the build collapses to near zero, the build stops being a differentiator. Everything that makes the product defensible moves to the parts the tools cannot give you: the methodology, the domain knowledge, and the trust. Her words were blunt. The real work was the compliance methodology, not the code. A payroll compliance company validated the vibe coding thesis on a main stage, and they were not a dev tools vendor trying to sell it.
What Papaya 1 actually does
The product reflects the same discipline. Onboarding is two clicks: pick your role, pick your countries of interest, no forms and no blank questionnaire. The homepage is not an empty chat box either, and that is a deliberate trust decision. An empty box invites the vague, badly framed question that produces the confident wrong answer in the first place. Instead it opens with predefined workflows and prompts built for who you are, so an HR lead focused on Germany sees a different product than someone else. Portfolio, countries, alerts, all tailored. Every user sees a different Papaya 1.
Upload a contract and it runs the three stages, returning a clean split: in the Brazilian contract demo, one concern flagged and thirteen items confirmed compliant, ready to hand to the legal team. Ask it to compare Germany and Brazil probation rules and the same three agents produce a side-by-side built to help you decide, not just a wall of text.
Don’t launch to everyone, and say it’s guidance
On go-to-market, Fishel was direct. Papaya did not launch Papaya 1 to every client at once. They picked a small pool, five to ten clients they trust, the ones who feel like partners and who already come to them with compliance questions. Those clients ask the right questions and give honest feedback instead of being polite, and the product gets better from real questions, not from internal testing. Every response is clearly marked as guidance, not legal advice.
Her advice to anyone deploying a domain-specific agent: do not launch to all customers in one day. The hardest question is not whether the agent works. It works. It can work after one day or after four weeks. The hardest question is whether you trust it enough to put your company’s name on it. That takes far longer to answer than the build.
How Papaya measures trust
Papaya tracks three signals to know whether trust is actually growing, and each one tells them something a satisfaction survey cannot:
- Are users coming back? Returning the next day or next week with another question means the last answer earned a second one. Repeat usage is the first proof that the agent is replacing the 2am ChatGPT habit.
- Are they asking harder questions? Higher-stakes questions over time mean trust is building, and they double as a roadmap. Where clients start pushing is where Papaya knows to deepen the product next.
- Are they using less outside counsel? This is the one that matters most. If clients are still forwarding the agent’s answers to their in-house legal team, trust has not fully formed and Papaya is not yet reducing their outside cost. Less outside counsel is the hard ROI proof, not a feeling.
When all three move in the right direction, they expand. Until then, they hold.
Build the guardrails before the features
Papaya’s most important safeguard is mechanical, not a line in a prompt. They built a kill switch. If accuracy drops below a threshold in any single country, they turn that country off until it is fixed. They pull the plug rather than patch as they go.
The principle Fishel pulled out of it: build the guardrails before you build the features. In a high-stakes domain, the failure mode is not a missing feature. It is a confident wrong answer that ships to a customer who acts on it. The kill switch exists because at some point the system will be wrong, and the only question is whether you find out before the customer does or after.
What you can copy and what you can’t
Fishel closed on what actually separates Papaya from anyone holding the same tools. Claude, Lovable, Supabase, anyone can sign up for tomorrow, and most of the room already had. What cannot be copied is the years of experience across 160 countries, the thousands of contracts and thousands of terminations Papaya has already reviewed. They are the dinosaurs in the industry, and that is the point.
The AI is the engine. The domain knowledge is the fuel. You can copy the engine. You can’t copy the fuel.
What to steal from Papaya’s playbook
If you are building a domain-specific agent in 2026, here is the build order their session lays out:
- Assume your real competitor is a general model your customer already trusts. Build to be more trustworthy than ChatGPT on the one question that carries the liability, not more featured than the last vendor.
- Turn every failure into a rule. Your eval suite is your moat. A competitor can copy your UI in a weekend and cannot copy 22 corrections you earned over months.
- Add a second model to check the first. Generation, adversarial review, synthesis. A model grading its own work keeps its own blind spots.
- Build the kill switch before the features. Decide in advance what accuracy threshold pulls a market offline, and pull it.
- Launch to five to ten clients who will tell you the truth. Real questions beat internal testing, and honest partners beat polite ones.
- Measure trust, not satisfaction. Repeat usage, harder questions, and less outside counsel are the signals that tell you when to expand.
Papaya had a working agent in four weeks. It took four months to trust it. Plan for the four months, because that is the part that actually wins the account.
