How do you evaluate an AI vendor?

Ask seven key questions: walk me through the model architecture, what happens when the model is wrong, how do you handle data drift, what's the retraining cadence, show me production metrics (not the demo), can this run on our infrastructure, and what happens when we leave. Any vendor with a real product can answer these clearly.

What are the red flags when evaluating AI vendors?

NDA required before seeing a demo, claims that their AI does everything, no reference customers in production for 6+ months, pricing that requires a conversation instead of being published, and inability to explain their AI in plain language.

How much should you spend on an AI vendor pilot?

Run a paid pilot (not free) for 2-4 weeks with real data and real users. Get production-scale pricing before the pilot starts, not after. Build a 3-year total cost of ownership model including API fees, integration, and maintenance.

How to Evaluate AI Vendors Without Getting Bullshitted

You’ve been in the market for an AI solution for about three weeks. You’ve sat through nine demos, received fourteen pitch decks, and you still can’t tell which vendors are real and which ones are running a glorified if-then statement behind a nice dashboard.

This is AI vendor evaluation in 2025, where every software company is an “AI company” and the word “proprietary” has lost all meaning.

Over the years I’ve put together a set of questions and a scoring framework that separates the real from the performative. None of it is complicated. It just requires asking the questions vendors hope you won’t.

Vendor Pitch Bingo

“Proprietary AI” usually means they fine-tuned an open-source model or wrapped an API. Nothing wrong with that, but it’s not proprietary in any meaningful sense. Ask what’s actually proprietary.

“State-of-the-art” means they got good results on a benchmark dataset. Benchmarks and production are different universes. Ask about production performance, not benchmark scores.

“Enterprise-grade” means they have SSO and a pricing tier above $50K/year. It tells you nothing about the actual quality of the AI.

“90% accuracy” is meaningless without context. 90% accuracy on what? Measured how? On what data? In a lab or in production? And what happens in the 10% of cases where it’s wrong? If those errors cost you $50K each, 90% accuracy might be terrible.

“AI-powered” is the most diluted phrase in enterprise software. A rules engine with a machine learning model somewhere in the pipeline is “AI-powered.” So is a lookup table with GPT generating the summary text. The phrase tells you nothing. Ignore it entirely.

If a vendor’s pitch deck is heavy on these phrases and light on specifics, that’s your first signal.

Seven Questions That Expose Weak Vendors

These aren’t trick questions. They’re basic due diligence that any vendor with a real product should be able to answer clearly.

1. “Walk me through the model architecture.”

You don’t need to understand every technical detail. But the vendor should be able to explain, in plain language, what kind of model they’re using, why they chose it, and what its limitations are.

A good answer sounds like: “We use a transformer-based model fine-tuned on domain-specific data for classification, with a rules engine handling edge cases the model isn’t confident about.”

A bad answer sounds like: “Our proprietary AI engine uses advanced machine learning across multiple modalities.” That’s not an answer. That’s a press release.

2. “What happens when the model is wrong?”

Every model is wrong sometimes. The question is whether the vendor has thought about it.

Good vendors have fallback logic, confidence thresholds, and human-in-the-loop workflows for uncertain predictions. They can tell you their error rate in production and what types of errors are most common.

Bad vendors act surprised by the question or say “our accuracy is very high.” That’s not a plan. That’s hope.

3. “How do you handle data drift?”

The real world changes. Customer behavior shifts. Market conditions evolve. The data your model was trained on six months ago may not represent what’s happening today.

Good vendors monitor for drift, have retraining pipelines, and can show you how model performance is tracked over time. They should be able to explain when and how the model gets updated.

Bad vendors don’t mention drift at all, or they say they “continuously improve” without being specific about what that means.

4. “What’s your retraining cadence, and who pays for it?”

This is a cost question disguised as a technical question. Models degrade over time. Retraining is a recurring cost. Some vendors include it. Some charge extra. Some don’t retrain at all and hope you don’t notice the degradation.

Get this in writing. How often is the model retrained? On whose data? At whose expense? What triggers a retrain?

5. “Show me production metrics, not the demo.”

Demos are curated. They show the happy path with clean data on a use case the vendor picked because it makes them look good.

Ask for production metrics from existing customers. Average accuracy in real deployment. Latency. Error rates. If the vendor can’t share these (even anonymized), that’s a problem. Either they don’t have production customers, or the numbers aren’t good enough to share. Both are concerning.

6. “Can this run on our infrastructure?”

Some vendors only work as a SaaS product, which means your data goes to their servers. For some companies and use cases, that’s fine. For others, it’s a non-starter due to data residency, compliance, or security requirements.

Understand the deployment model before you get deep into evaluation. If you need on-premises or private cloud deployment and the vendor can’t do it, no amount of impressive demos matters.

7. “What happens when we leave?”

Data portability and lock-in are the questions nobody asks until it’s too late.

If you leave this vendor in two years, what do you take with you? Your data, obviously, but what about the models trained on your data? The integrations built on their platform? The workflows that depend on their APIs?

Some vendors make leaving easy because they’re confident in their product. Others make it hard on purpose. Know which one you’re dealing with before you sign.

Red Flags

These aren’t disqualifying on their own, but if you see three or more, walk away.

They require an NDA before showing a demo. What are they hiding? A demo is a sales tool. If they won’t show you the product without legal protection, the product probably doesn’t do what the pitch deck says it does.

“Our AI does everything.” No it doesn’t. A vendor that claims their single product handles document processing, predictive analytics, natural language understanding, computer vision, and anomaly detection is either lying or doing all of those things poorly.

No reference customers. If a vendor can’t connect you with a single customer who’s been in production for more than six months, you’re the pilot. That might be fine if you’re getting a significant discount, but go in with your eyes open.

Pricing requires “a conversation.” Opaque pricing usually means the vendor sizes you up before quoting. You end up paying based on what they think you can afford rather than what the product costs. Companies with confidence in their product publish their pricing.

They can’t explain their AI in plain language. Complexity is not the same as sophistication. If a vendor can’t explain what their system does without resorting to jargon and buzzwords, either they don’t understand their own product or they’re deliberately obscuring it.

The “AI” is mostly a UI on top of an existing API. This isn’t necessarily bad, but you should know what you’re paying for. If the core AI is OpenAI’s API and the vendor’s value-add is the interface and integration, the pricing should reflect that.

Green Flags

These are what good looks like.

Transparent about limitations. Good vendors tell you what their product doesn’t do well. They’ll say “we’re not great at X type of data” or “this works best when you have at least Y records.” This honesty signals that they understand their own product deeply.

Clear pricing with defined scope. You know what you’re paying, what you’re getting, and what costs extra. No surprises.

Production metrics available. They can show you real numbers from real deployments. Not cherry-picked success stories, but average performance across their customer base.

Paid pilot before full contract. A vendor that offers a paid pilot (not free, paid at a reasonable rate) is confident their product works and wants to prove it before you commit to an annual contract. Free pilots often mean the vendor is desperate. Paid pilots mean they value their own product.

Technical team accessible during sales process. If you can talk to their engineers during evaluation, not just their sales team, that’s a strong signal. It means they’re not afraid of technical scrutiny.

The Scoring Framework

Use this rubric to compare vendors consistently. Rate each dimension 1-5. Total the scores. It won’t make the decision for you, but it makes comparison much easier.

Dimension	1 (Poor)	3 (Adequate)	5 (Strong)
Technology Maturity	Prototype or beta stage; no production deployments	Production product with some customers; still evolving	Battle-tested in production; clear roadmap; proven architecture
Transparency	Can’t explain model; hides behind buzzwords	Explains approach at high level; some detail available	Full architecture discussion; publishes methodology; shares limitations openly
Production Readiness	No monitoring; no drift detection; manual processes	Basic monitoring; some automation; metrics available on request	Full MLOps pipeline; automated retraining; real-time monitoring dashboards
Data Governance	Unclear data handling; no compliance documentation	Basic data policies; SOC 2 in progress; standard encryption	SOC 2 Type II; clear data residency; your data is yours; easy export
Pricing Clarity	“Let’s have a conversation”; no published pricing	Published tiers but lots of add-ons and fine print	Clear pricing; defined scope; no surprises; pilot pricing available
Integration Fit	API only; no docs; requires custom development	Standard APIs; documentation exists; some connectors available	Native integrations with your stack; SDKs; developer support; sandbox environment
Vendor Stability	Pre-revenue startup; unclear funding situation	Funded with customers; growing but early	Profitable or well-funded; strong customer base; been in business 3+ years

Scoring guide:

30-35: Strong candidate. Move to pilot.
22-29: Promising but has gaps. Dig deeper on low-scoring dimensions.
15-21: Proceed with caution. Significant risk in multiple areas.
Below 15: Walk away.

Score each vendor independently, then compare. If your top two vendors are within 3 points of each other, the deciding factor should be integration fit and whether you actually like working with them, because you’ll be working with them for a while.

When the Answer Is “Build It Yourself”

Sometimes you go through this entire evaluation process and none of the vendors are right. Maybe the use case is too specific. Maybe the vendors are all early-stage and you can’t afford to bet on one. Maybe the pricing is out of line with the value.

Building internally is a valid choice if you have the engineering team to support it, but be honest about the total cost. Building a model is 20% of the work. The other 80% is data engineering, monitoring, retraining, edge case handling, and ongoing maintenance . “We’ll build it ourselves” sounds straightforward until month six when the model is drifting and nobody owns it.

Build-vs-buy isn’t just an upfront cost question. It’s a question about who carries the operational burden for the next three years. If AI is going to be a core competitive advantage for your business, building makes sense. If it’s a supporting function, buying is almost always cheaper over three years.

The Evaluation Process

Here’s the process I’d follow:

Define requirements before talking to vendors. Know what problem you’re solving, what data you have, what success looks like, and what your constraints are (budget, timeline, security, deployment model).
Create a shortlist of 3-5 vendors. More than five is a waste of time. If you can’t narrow to five based on requirements alone, your requirements aren’t specific enough.
Use the seven questions in initial calls. This usually eliminates one or two.
Score remaining vendors using the rubric above.
Run a paid pilot with your top choice. Use real data, real users, real conditions. Two to four weeks is enough to know.
Negotiate the contract with production metrics from the pilot as leverage. If the pilot showed 85% accuracy and the vendor promised 95%, that’s a conversation to have before signing.

Don’t rush this process. A bad AI vendor costs you more than no AI vendor. The subscription fees are the least of it. You lose engineering time on integration, organizational trust when it doesn’t work, and six months you’ll never get back.

I’m Eric Brown. I work with mid-market companies as a fractional CTO and AI strategy consultant in Denver, Colorado. If you’re sorting through vendor pitches and want an outside perspective, let’s talk .