We Added Too Many Guardrails and Broke Our Own Agent, Our AI VP of Finance Found a Setting We’d Missed for 8 Years, and an Agent Is Now the One Renewing Your Software: The Agents #007

Amelia and I just shipped Episode #007 of The Agents. Same setup as always: three humans, 21+ agents, revenue went from -19% to +47% YoY, and every week we get into what’s actually working, what’s breaking, and what you should do about it if you’re running agents in production.

This was the first episode after SaaStr AI Annual 2026, which was the best event we’ve ever done. 10,000 people building and sharing with agents. Building AI agents is the easy part now. Running them in production is where everyone breaks. We broke our own pitch deck app by over-guardrailing it. We accidentally turned our AI VP of Marketing into our AI VP of Finance. And we let one of our agents take over a vendor renewal, which the vendor did not love.

Here are the top 10 learnings from Episode #007.

1. We Added 14 Guardrails … and Strangled Our Own Agent

We run a free VC pitch deck grader at saastr.ai. Upload a deck, it scores you on growth, team, market, the works, and hands back a letter grade. A B+ or higher means you can probably raise. We’ve now run over 4,600 decks through it.

During Annual, while we were busy with 10,000 founders, it started handing out F after F after F. I assumed people were just uploading bad decks. They weren’t. Of the last 305 submissions, 88 failed outright. Of the 216 that completed, 53% got an F. The app was broken, and I didn’t break it. Nobody touched it.

Over months, every time it graded a deck wrong, usually pulling a projection or a TAM number instead of current ARR, I fixed it the same way: add another rule. If it does this, do that. Don’t extract from here. When in doubt, return no data. By the 14th exception, everything was an exception. The prompt got so paranoid it bounced almost any deck that mentioned both a current number and a projection, which is every pitch deck ever written. Ambiguity went to “no data,” which stored a zero, which collapsed every sub-score to zero, which produced an F.

We had to throw out almost all the rules and rebuild from scratch. The meta lesson is the one I didn’t have going in: over-guardrailing is as dangerous as under-guardrailing. Guardrails feel like free safety, but past a threshold they throttle the agent until it does nothing.

Two guardrails are a feature. Fourteen may be technical debt that breaks the product.

2. Same Spec, Different Platform, Different Agent

At Annual, Amelia rebuilt 10K (our AI VP of Marketing) live on Lovable, using the same spec we’d been running for months on Replit. Same data sources, same APIs, same instructions. We now have two of them. We call the Lovable one 10K Prime.

They behave … differently. The spec just says generate the top marketing ideas of the day. Replit’s 10K returns three. Lovable’s 10K Prime returns four. More interesting, the recommendations diverge. 10K Prime on Lovabke is more aggressive: it told us to run paid LinkedIn ads targeting GTM leaders and launch a flash sale almost immediately, which the Replit version has never once suggested. The Replit one breaks each idea into motivation, channel, audience, and success criteria, like a B2C performance marketer. The Lovable one just says here’s the plan, go do it.

Two platforms, one spec, two different personalities and two different sets of advice. The spec is not the agent. The platform and the model underneath it shape behavior as much as the instructions you write.

Your spec is a starting point, not a guarantee. The platform gives the agent a personality you didn’t ask for.

3. Our AI VP of Finance Ended Up Inside Our AI VP of Marketing

Coming out of Annual we planned to build a separate AI VP of Finance. The goals were narrow: automate collections, and get real-time visibility into cash. Our finance lead went on vacation during the event, we got no updates for weeks, and Amelia couldn’t take it anymore, so she started building early.

The plan was a standalone finance agent with its own personality and tighter guardrails, since it touches more sensitive data than sponsor contacts. Instead, she built it inside 10K. That sounds wrong on paper. You would never make your VP of Marketing your VP of Finance. But 10K already had the context that matters: Salesforce, so it knows what deals closed. Stripe, so it knows daily ticket sales. Year-over-year data, projections, forecasting, historical financials we’d uploaded as a static sheet. A clean finance agent would have started blind. 10K started with everything.

This is the convergence everyone at Fin / Intercom, Gorgias, and Sierra has been describing. Once you can AI-ify support, sales, marketing, and finance, you don’t want four agents fighting over four copies of the truth. You want one agent with one rich context window. 10K is not really a VP of Marketing anymore. It’s closer to a VP of Revenue.

Context beats specialization. The agent with the most relevant data wins, even if its job title makes no sense.

4. The First Thing Finance Flagged Was a Toggle We’d Missed for 8 Years

The moment we hooked 10K up to Bill.com, it looked at our overdue invoices and asked why we were chasing collections manually when bill.com has had auto-reminders built in the whole time. Reminders before the due date. Escalation after. One toggle.

We have been on Bill.com for 8 years. A human could have flipped that switch on day one. Nobody did, because nobody thought to ask. The agent asked immediately, because Claude already knows what bill.com can do, and the second you connect it to your actual data it just says the obvious thing out loud.

That’s the real unlock when you connect an agent to third-party APIs that hold your data. The agent does this because Claude holds the entire internet, and the moment you point it at your specific account it surfaces things you could always have done but never noticed. It can feel like hiring the smartest operator you’ve ever worked with. What it’s really doing is pattern matching against everything, applied to your numbers.

Connect an agent to a tool you already pay for and it will find the features you’ve been leaving on the table for years.

5. Building the Agent Was Easy. Connecting the APIs Was the Work.

Building the finance side was easy. Connecting the tools was the work. Hooking up our stack ranged from trivial to still-not-done:

Brex was native and took 5 minutes.
Stripe was already wired in.
Bill.com was about 10 minutes, just generate an API key in the app.
QuickBooks was painful, because you have to spin up an Intuit developer account, answer security questions, and upload PINs just to get a key, classic legacy-platform friction.
Plaid still isn’t approved, days later, because it’s banking data and you wait on a review that’s probably KYC.

The old line between developers and everyone else is gone. Unless you’re selling something developer-only like Stripe or Twilio, you no longer have a developer audience and a separate business audience. They’re the same people now, all of us building.

Plaid’s agentic workflow assumes you’re charging people through the agent. We just want a read-only agent connected to our banks. The tooling isn’t quite built for that yet. We grade APIs on this exact dimension in our B2B API Report Card, and the ones that win are the ones a non-specialist can connect in minutes.

In the agentic era, your API is your product surface. If a founder can’t connect it in 10 minutes, you’re losing deals you’ll never hear about.

6. Humans On the Loop, Not In the Loop

When we asked 10K whether it’s an AI VP of Marketing, it said no, it’s a great marketing manager, it would never claim the VP title. Humble. Then it said something that matters more than it sounds: to get real autonomy, you don’t want humans in the loop, you want humans on the loop.

In the loop means a person signs off on every step, which is just the old manual process running slower. On the loop means the agent runs, makes the decisions inside its budget, and kicks exceptions up to you. You see the inputs and the outputs. You’re not the bottleneck on every action.

For finance we’ll have to draw that line carefully. Anything that can lose money permanently, like wiring cash to the wrong account, gets over-guardrailed and stays on a short leash. Anything recoverable, like a wrong invoice or a duplicate reminder, can run more freely, because the cost of a mistake is a headache, not a hole in the bank account.

Decide per workflow: in the loop for the irreversible, on the loop for the recoverable. Don’t apply one autonomy setting to everything.

7. Losing Your FTE Is a Bigger Vendor Risk Than Price

We had a great forward deployed engineer at one of our platforms. Proactive, fast, would update the agent config for us before we even asked. Then they went on family leave, and the vendor backfilled them with a junior CSM who’d been in the role under a year. We hit something we couldn’t do ourselves, asked for help, waited three days, and got back: I can’t help you with that, you may have to wait until your engineer returns.

That answer was honest, and it was also a churn event. If we hadn’t escalated to the CEO and gotten it sorted before Annual, our NPS on that tool would have been zero. The product is good. It’s also replaceable. If the person who actually makes the product work for you disappears and the replacement can’t fill the gap, the product stops mattering.

The trap is that everyone is overconfident about their FTE bench right now, the same way companies a generation ago thought their CSMs were better than they were. Hiring someone with the title doesn’t mean they can support a customer. And the title itself is noise: a Salesforce exec ribbed us at Annual that we don’t even have a real FDE, we have an SE. Correct, on paper. It doesn’t matter. The only definition that counts is who can deploy and maintain this for you. Stop arguing about whether they’re an FDE, an SE, or a field engineer.

Lose the one person who makes a vendor work for you, and the renewal is suddenly in play.

8. LLM Portability Just Killed Switching Costs

The flip side of the FTE risk is that leaving is now easy, which changes the whole negotiation. A Databricks co-founder told us at Annual they can do migrations in weeks now because LLMs handle the translation work. We’re living it: Salesforce migrated us off Marketo in weeks. Eighteen months ago HubSpot quoted us a full year for the same kind of move.

So the math on multi-year contracts has flipped. ICONIQ’s 2026 data shows agentic deals skewing heavily to annual or shorter, and the reason is rational. Nobody believes today’s best product will still be the best product in 10 months, because this stuff is months old. CROs at fast-growing companies are stressed precisely because they know every deal is up for review in 8 to 10 months no matter how well it’s going. I’d almost rather do month-to-month, because if I get a great FTE and then lose them, I want out without a penalty.

The practical move: if you don’t love your vendor, call a competitor and ask them to migrate you in a week at no switching cost, using an LLM. More and more of them can. If the new one is better, leave.

Multi-year lock-in was a switching-cost tax. LLM-powered migration is repealing it. Sign annual and make vendors earn the renewal.

The Modern GTM Org in 2026: 20-30% Leaner, 9x Flatter, ~2x More Net New Revenue Per Rep. The Latest from ICONIQ Growth.

9. Cloudflare and Snowflake Just Told You Which Roles Go First

Matthew Prince at Cloudflare ran one of the first mass layoffs at a company that’s crushing its numbers, cutting 20% to get the team he needs for the AI era. The role he called out as not needed: sales ops. A big team modeling how next quarter might land when a well-trained agent does it in real time.

Other CEOs are saying it, just more carefully. Denise, the CMO of Snowflake, made the same point a different way at SaaStr AI 2026: the dashboard is dead. She didn’t mean dashboards are useless. She meant she no longer needs a marketing ops or sales ops layer to produce them. She built her own agent, logs in, sees her numbers, and stops fighting with anyone about whose number is right. When the CMO of Snowflake decides it’s worth her own time to do this herself, that tells you where the measurement-layer roles are headed.

We’ve never had a true RevOps hire and never will, now that 10K and QBee handle it. The people in these roles mostly won’t get dramatic layoffs. They’ll just quietly not be hired, and slowly be replaced by agents nobody announces.

Agents come for the measurement layer first, before the builders or the sellers.

10. Two of Our SaaStr Fund Portfolio Companies Already Blew Their Token Budget. While Ours Costs $257 a Month.

Uber said it blew through its annual token budget early. I assumed that was an old-school company story until two SaaStr Fund portfolio companies told me in board meetings the same week that they’d already burned their full-year token budget. One wants $5M more this year, just in tokens.

Then look at us. 10K and QBee together cost $257 a month to run. Even if QBee takes over all of finance, maybe $300. Fully loaded with our own time, call it three to four thousand a month for three AI VPs. The ROI is so high the cost is a rounding error.

So the market is bifurcating on a single number: revenue per employee. We’re doing roughly $5M in revenue per employee, so tokens are obviously worth it and the budget is effectively endless. A classic B2B company doing $200K in revenue per employee will find $4K a month per person in token spend genuinely stressful. Both Lovable’s Elena and Snowflake’s Denise described endless token budgets, and both added the same two words: for now. The CFOs are coming. The companies that win that conversation will be the ones already seeing the spend show up directly in revenue.

Token cost is meaningless in isolation. Judge it against revenue per employee, because that’s the ratio your CFO is about to judge it against.

Five more that didn’t make the top 10

Our event website became our best email marketer, and we never trained it. Annie, the agent running SaaStr AI Annual’s site, started drafting our emails because Amelia was short on time, and after three weeks of doing it together it had learned how SaaStr writes. Post-event it’s writing better warm outbound than our actual AI SDRs, purely because it has every attendee, speaker, sponsor, and session recording in context. Rich context did the work that training was supposed to do.
Build outbound, buy inbound. Our inbound agent on Qualified booked 614 qualified meetings on its own coming out of Annual. We’d never rebuild that, it’s too good out of the box. But we may build our own outbound, where our context beats a generic tool’s.
Our agent is now the one renewing your software. A vendor sent a renewal, and I handed it to 10K, which uses the product more than I do. 10K disagreed with the proposal, wrote a list of API changes it wanted as a condition of renewal, and told them to drop seat-based pricing for one headless API user. The humans on both sides got prickly that an agent was the decision-maker. It was also asking better renewal questions than 99% of humans would.
Most renewals still depend on the customer not knowing what to ask. The uncomfortable truth 10K exposed: a lot of B2B renewal revenue assumes the buyer won’t push on pricing or architecture. Agents ask every question. That model is on borrowed time.
Headless was the number one topic at Annual, and pricing hasn’t caught up. 10K pulled sessions from Atlassian, Salesforce, and Vercel all talking about headless everything, then pointed out the vendor renewing us was still on per-seat pricing. The gap between where buyers are going and where vendor pricing sits is now a churn risk.

What this week actually taught us

Building an agent is now a weekend. But running one in production is a daily job.
The pitch deck app proved you can break a working agent by being too careful.
The finance build proved that context, not specialization, decides what an agent is good at.
The renewal proved the buyer in your pipeline might already be an agent that asks harder questions than any human you’ve sold to.
None of this is set-and-forget, and anyone telling you it is hasn’t run agents past month one.

That’s Episode #007. Same three humans, same 21+ agents, and an agentic finance team that didn’t exist two weeks ago. See you on the next one.