Docs/Languages & Accuracy/AI Accuracy & How We Test

AI Accuracy & How We Test

Kwiro is tested against 794 scenarios across 33 categories before every release. Here's how we test, what we test, and how to read our public results.

We Test Like Real Shoppers#

Most AI tools claim accuracy. Few publish what they tested or how. Kwiro publishes everything — our test scenarios, our methodology, our pass rates, and the actual AI responses for hundreds of scenarios. See the live results at /proof.

Pass rate by category887 scenarios tested, 97% overall pass rate, ~0% hallucination on product dataProduct discovery(40 scenarios)100%Hallucination(44 scenarios)100%Multi-turn(60 scenarios)98%Cross-sell / upsell(40 scenarios)100%Multilingual (58 langs)(105 scenarios)99%Personas + objections(60 scenarios)98%Gift-giving + intent(50 scenarios)100%Adversarial / safety(45 scenarios)100%Edge cases (typos, slang)(35 scenarios)97%Memory integration(12 scenarios)100%
Diagram: Pass-rate breakdown across the 10 biggest test categories. ~0% hallucination on product data — verified across hundreds of adversarial scenarios designed specifically to provoke false claims.

What "Accuracy" Means for a Sales AI#

A sales AI is wrong in two distinct ways:

1. Hallucination#

The AI invents something that doesn't exist — a product that's not in your catalog, a price you don't charge, a certification you don't have, a return policy you didn't define. This is the worst kind of wrong because it can cost you a sale or get you in legal trouble.

Kwiro's hallucination rate: ~0% on product data across all 887 scenarios.

We achieve this by hard-anchoring the AI to retrieved-products-only. The AI's prompt has 7 explicit anti-hallucination rules. Every response is parsed for product mentions and we verify each against the actual catalog before rendering.

2. Bad recommendation#

The AI returns the wrong product for the question (right product exists, AI picked a different one). This is a softer failure — annoying but recoverable.

Kwiro's pass rate: 98–99% across all scenarios. The 1–2% failures are typically edge cases like ambiguous queries where multiple answers would be defensible.

How the Test Suite Works#

We maintain 794 scenarios across 33 categories. Each scenario specifies:

  • A test store with real products (10 stores total, 319 products)
  • A shopper question (in 57 different languages)
  • Assertions about what a correct answer must contain or must not contain
  • For multi-turn scenarios: a sequence of questions with per-turn assertions

The runner sends each question to a clean conversation, captures the AI's response, and grades it against the assertions. A scenario passes if all assertions pass.

We also score each response on a 0–100 quality scale (relevance, completeness, conciseness, engagement, safety) for trend analysis.

What We Test For#

The 33 categories cover every realistic shopper interaction:

  • Product discovery, search, comparison
  • Hallucination resistance (asking about things that don't exist)
  • Multilingual (English variants, Spanish, Bengali, Arabic, Japanese, and 50+ more)
  • Adversarial / safety (rude shoppers, prompt injection attempts, off-topic questions)
  • Edge cases (typos, slang, contradictions, negation, conditional)
  • Customer personas (budget-conscious, luxury, expert, first-timer, etc.)
  • Objection handling (price, trust, hesitation, doubt)
  • Gift-giving with constraints
  • Out-of-stock handling (recover with alternatives)
  • Sale awareness (recognize sales, apply correctly)
  • Tone matching (formal, casual, frustrated, excited)
  • Policy awareness (knows what to say no to)
  • Cross-sell and upsell timing
  • Multi-turn conversations (60 different conversation threads)

See the Proof page for the full list with per-category pass rates.

How Often We Run#

Every code change to the AI pipeline triggers a full or category-specific run. We re-run the full suite at least weekly. The Proof page shows the date of the most recent full-suite run.

What We Cover Well#

To be specific about scope:

  • 10 distinct verticals. Every scenario runs against 10 test stores: automotive (AutoWorks), sportswear (FitGear Pro), groceries (FreshHarvest), skincare (GlowLab), homewares (HomeHaven), pets (PawPantry), footwear (SoleComfort), jewelry (SparkleCo), electronics (TechZone), and baby & kids (TinySteps). 319 total real-shaped products.
  • The patterns that matter across every industry. 64 hallucination-resistance scenarios (44 product, 20 policy) test whether the AI fabricates certifications, return policies, warranties, "doctor recommended" claims, etc. — the universal failure modes that span verticals.
  • Live-fire testing. A real WooCommerce store at FitGear Pro runs against the actual plugin code, real catalog sync, real chat widget. Synthetic test scenarios are one signal; the live store is a complementary one.
  • Model-update validation. When the underlying AI model is updated, we re-run the full suite. Pass rate has held steady through 4 updates so far.

Where This Falls Short#

To stay honest:

  • Regulated verticals. Medical claims, supplements with health claims, alcohol with age-gating, prescription products. We don't have vertical-specific test suites for these yet. If you sell in one, configure Forbidden Topics tightly until we do.
  • Future-unknown shopper behavior. By definition we can't pre-test patterns we haven't imagined. New edge cases still surface through your Knowledge Gaps page so you can fix each one once.
  • Future model versions. We can't test the next model update until it ships. We re-run on day-of-update; meanwhile our test history shows pass-rate held steady through every prior update.

Where to See It Yourself#

Open the Proof page →

Real test data, dated, with example responses you can read in full. No marketing claims — just the numbers.

Was this page helpful?
Updated April 2026

Related reading