Jim Palmer, Chief AI Officer, Dialpad came to SaaStr AI Annual + AI Summit to share the top AI mistakes he made building AI Agents at scale for $350m+ ARR Dialpad.

Jim is the Chief AI Officer at Dialpad with over a decade of AI experience. He founded TalkIQ, one of the early pioneers in AI and conversational intelligence, which was acquired by Dialpad over seven years ago. Since then, he’s been leading AI investments and building production AI systems at scale.


Top 5 Learnings

1. Data Governance is the Foundation of Everything Data governance isn’t just a checkbox—it’s the perpetual investment that determines whether your AI succeeds or fails. You need to know what data you have, how it flows, what’s opt-in versus opt-out, what’s private versus public, and how you’re handling anonymization and PII filtration. This is non-negotiable whether you’re building in-house or using third-party LLMs.

2. Red Teaming is Your Cheat Code Red teaming—adversarial testing of your AI systems—helps you discover what your AI can’t do (and as a bonus, what it can do). This continuous process reveals attack vectors, identifies where guardrails are needed, finds unanswered questions, and prevents harmful content or hallucinations before they reach customers. It’s not just security theater; it’s how you build a training dataset and improve systematically.

3. Don’t Overestimate Third-Party AI Power Just because a trillion-parameter model exists doesn’t mean you need it. GenAI can solve classifier problems, but a traditional classifier might be equally or more accurate for many use cases—at a fraction of the cost. Understand when to use generative AI and when not to. Plan and manage cost, throughput, and especially risk.

4. Real-Time vs. Deferred vs. Batch Matters More Than You Think The timing requirements of your use cases dramatically affect pricing, customer experience, and scaling. What needs to be real-time? What can be deferred? What can run in batch? Each has a different cost and scaling profile, especially when leveraging third parties. Getting this wrong impacts both your bottom line and your customer experience.

5. Start Small with Training Investments—RAG is Your On-Ramp You don’t need to build a fully pre-trained model on day one. Start with retrieval augmented generation (RAG), which leverages third parties while building your own data assets. Move up the ladder: RAG → fine-tuning for specific use cases → intermediate continued pre-training → full pre-training for high-level generalization. The data you build is yours—data is king, data is gold.


The Data Governance Reality

Here’s what most people miss: data governance isn’t a one-time project. It’s a perpetual investment that evolves constantly. Whether you have an in-house AI team or you’re using third parties (or multiple third parties), you always need a data governance story.

The simplified flow covers critical questions most teams ignore:

  • Do you actually know what data you have?
  • Do you understand the flow of that data?
  • What can you use and what can you not use legally and ethically?
  • What’s your opt-in and opt-out strategy?
  • How are you handling anonymization and PII filtration?

If you’re using third-party LLMs, you need to understand their data processing agreements. What protections do you have in place? What are you potentially giving away?

This matters even more as we move into the Model Context Protocol (MCP) and agent-to-agent (A2A) communication era. When AI systems start talking to other AI systems, understanding what data you’re sharing and what those systems will do with it becomes exponentially more important.

The Red Teaming Advantage

Red teaming is the major cheat code Dialpad has been using to understand AI limitations while simultaneously discovering capabilities. This isn’t just Cold War-style adversarial testing—it’s a systematic approach to building better AI.

The process can be manual or use open-source tools. It can start as simple as a shared spreadsheet. The key is thinking like a CISO: What are my attack vectors? Where are the vulnerabilities?

But red teaming does more than find security holes:

  • It reveals where and why AI fails on specific tasks
  • It helps you determine what guardrails you need (custom, not just downloaded)
  • It identifies unanswered questions that you can mine for training data
  • It catches hallucinations and harmful content before customers see them
  • It improves deflection rates for chatbots and digital experiences

As the HubSpot CEO discussed in another session, unanswered questions are gold. Red teaming helps you harvest and mine that data systematically, feeding it back into your AI experience regardless of whether you’re building in-house or using third parties.

Dialpad does this continuously—not as a one-off exercise. The net win has been significant.

The Cost and Use Case Reality Check

One of the biggest mistakes teams make is not understanding when to use generative AI versus other approaches. GenAI can solve many problems brilliantly, including classification problems. But do you really need a trillion-parameter model to classify something when a traditional classifier could be almost as accurate—or even more accurate in some cases—at a fraction of the cost?

The calculus changes dramatically based on your use case timing requirements:

Real-time: Highest cost, most impact on customer experience, requires careful scaling consideration. This is where you need to be most selective about using expensive models.

Deferred: Middle ground. You can optimize costs while still maintaining reasonable customer experience.

Batch: Lowest cost per operation, no immediate customer impact, but can’t solve real-time problems.

Each has a different cost and scaling profile, especially when leveraging third-party APIs versus in-house infrastructure. Getting this mapping wrong affects both your P&L and your customer satisfaction scores.

The Training Investment Ladder

Here’s the practical path for training investments that works whether you’re a startup or an enterprise:

Level 1: RAG (Retrieval Augmented Generation) Start here. It’s not free, but it’s the best starting point. RAG leverages third-party LLMs while building your own data assets. You’re still using someone else’s model, but the data you’re curating is yours. This helps you manage and maintain data while telling your governance story.

Level 2: Fine-tuning Once you understand your data through RAG and red teaming, fine-tune for specific use cases. This increases accuracy for particular scenarios based on customer feedback and the data you’ve collected.

Level 3: Intermediate Continued Pre-training As you scale, you can do continued pre-training on your domain-specific data. This requires deeper investment in data governance but delivers higher accuracy across more use cases.

Level 4: Full Pre-training The holy grail—fully pre-trained models that achieve high-level generalization. This requires significant upfront investment and cost, but if you’ve climbed the ladder properly, you’ll know whether it’s worth it based on ROI.

The beauty of this approach: even if you never build your own models, the data work you do for RAG sets you up for success. The data is yours. You own it. And if you later decide to fine-tune or pre-train, you have the foundation ready.

The Continuous Improvement Mandate

Here’s what separates successful AI investments from failed ones: continuous learning and measurement.

Don’t just add an observability layer or raw telemetry (though yes, do that too). Continuously test your systems. Bring humans into the loop. Bring other LLMs into the loop for evaluation—don’t be afraid of using LLM-as-a-judge approaches.

Your customers are already telling you where your AI works and where it doesn’t. Are you listening? Are you systematically capturing that feedback? Are you feeding it back into your training data?

And critically: measure ROI. As multiple speakers at SaaStr have emphasized, we need to finally get to that pay-to-value equation. AI for AI’s sake doesn’t cut it. Show the business impact.

The Agent-to-Agent Future

As we move into MCP and agent-to-agent communication, data governance becomes even more critical. When your AI systems start sharing information with other AI systems, you need rock-solid answers to:

  • What information are you sharing?
  • What will those other AI systems do with that data?
  • What are your attack vectors?
  • How do you maintain control and compliance?

The teams that have invested in data governance from day one will have a massive advantage. Those that haven’t will be scrambling to retrofit governance onto systems that were never designed for it.


Top 4 Mistakes Jim Made (and You Can Avoid)

1. Not Investing in Data Governance Early Enough Looking back at the TalkIQ days and early Dialpad AI work, data governance should have been day-one priority, not something bolted on later. The cost of retrofitting governance onto existing systems is exponentially higher than building it in from the start. If Jim could do it over, he’d have the full data flow mapped—what’s available, what’s opt-in/opt-out, PII filtration, anonymization—before writing a single line of AI code.

2. Overestimating What Third-Party Models Could Do Out of the Box In the early days, there was too much faith that third-party LLMs would “just work” for specialized use cases. The reality: you almost always need some level of customization, whether that’s RAG, fine-tuning, or full training. The mistake was not planning for that investment upfront and assuming plug-and-play would be sufficient for production-grade accuracy.

3. Not Implementing Red Teaming from Day One Red teaming should have been part of the development process from the beginning, not something added after initial deployment. Waiting meant discovering problems in production that could have been caught earlier. The continuous red teaming approach Dialpad uses now should have been the standard from the start—it would have saved countless hours of firefighting and customer escalations.

4. Underestimating the Importance of Use Case Timing Classification Not properly categorizing use cases into real-time, deferred, and batch from the architecture phase led to over-engineering some features and under-engineering others. This affected both cost structure and customer experience. A clearer framework upfront for which features truly needed real-time processing versus which could be deferred or batched would have resulted in better resource allocation and more predictable scaling costs.

Related Posts

Pin It on Pinterest

Share This