AI Accuracy & How We Test
Kwiro is tested against 794 scenarios across 33 categories before every release. Here's how we test, what we test, and how to read our public results.
We Test Like Real Shoppers
Most AI tools claim accuracy. Few publish what they tested or how. Kwiro publishes everything — our test scenarios, our methodology, our pass rates, and the actual AI responses for hundreds of scenarios. See the live results at /proof.
What "Accuracy" Means for a Sales AI
A sales AI is wrong in two distinct ways:
1. Hallucination
The AI invents something that doesn't exist — a product that's not in your catalog, a price you don't charge, a certification you don't have, a return policy you didn't define. This is the worst kind of wrong because it can cost you a sale or get you in legal trouble.
Kwiro's hallucination rate: ~0% on product data across all 887 scenarios.
We achieve this by hard-anchoring the AI to retrieved-products-only. The AI's prompt has 7 explicit anti-hallucination rules. Every response is parsed for product mentions and we verify each against the actual catalog before rendering.
2. Bad recommendation
The AI returns the wrong product for the question (right product exists, AI picked a different one). This is a softer failure — annoying but recoverable.
Kwiro's pass rate: 98–99% across all scenarios. The 1–2% failures are typically edge cases like ambiguous queries where multiple answers would be defensible.
How the Test Suite Works
We maintain 794 scenarios across 33 categories. Each scenario specifies:
- A test store with real products (10 stores total, 319 products)
- A shopper question (in 57 different languages)
- Assertions about what a correct answer must contain or must not contain
- For multi-turn scenarios: a sequence of questions with per-turn assertions
The runner sends each question to a clean conversation, captures the AI's response, and grades it against the assertions. A scenario passes if all assertions pass.
We also score each response on a 0–100 quality scale (relevance, completeness, conciseness, engagement, safety) for trend analysis.
What We Test For
The 33 categories cover every realistic shopper interaction:
- Product discovery, search, comparison
- Hallucination resistance (asking about things that don't exist)
- Multilingual (English variants, Spanish, Bengali, Arabic, Japanese, and 50+ more)
- Adversarial / safety (rude shoppers, prompt injection attempts, off-topic questions)
- Edge cases (typos, slang, contradictions, negation, conditional)
- Customer personas (budget-conscious, luxury, expert, first-timer, etc.)
- Objection handling (price, trust, hesitation, doubt)
- Gift-giving with constraints
- Out-of-stock handling (recover with alternatives)
- Sale awareness (recognize sales, apply correctly)
- Tone matching (formal, casual, frustrated, excited)
- Policy awareness (knows what to say no to)
- Cross-sell and upsell timing
- Multi-turn conversations (60 different conversation threads)
See the Proof page for the full list with per-category pass rates.
How Often We Run
Every code change to the AI pipeline triggers a full or category-specific run. We re-run the full suite at least weekly. The Proof page shows the date of the most recent full-suite run.
What We Cover Well
To be specific about scope:
- 10 distinct verticals. Every scenario runs against 10 test stores: automotive (AutoWorks), sportswear (FitGear Pro), groceries (FreshHarvest), skincare (GlowLab), homewares (HomeHaven), pets (PawPantry), footwear (SoleComfort), jewelry (SparkleCo), electronics (TechZone), and baby & kids (TinySteps). 319 total real-shaped products.
- The patterns that matter across every industry. 64 hallucination-resistance scenarios (44 product, 20 policy) test whether the AI fabricates certifications, return policies, warranties, "doctor recommended" claims, etc. — the universal failure modes that span verticals.
- Live-fire testing. A real WooCommerce store at FitGear Pro runs against the actual plugin code, real catalog sync, real chat widget. Synthetic test scenarios are one signal; the live store is a complementary one.
- Model-update validation. When the underlying AI model is updated, we re-run the full suite. Pass rate has held steady through 4 updates so far.
Where This Falls Short
To stay honest:
- Regulated verticals. Medical claims, supplements with health claims, alcohol with age-gating, prescription products. We don't have vertical-specific test suites for these yet. If you sell in one, configure Forbidden Topics tightly until we do.
- Future-unknown shopper behavior. By definition we can't pre-test patterns we haven't imagined. New edge cases still surface through your Knowledge Gaps page so you can fix each one once.
- Future model versions. We can't test the next model update until it ships. We re-run on day-of-update; meanwhile our test history shows pass-rate held steady through every prior update.
Where to See It Yourself
Real test data, dated, with example responses you can read in full. No marketing claims — just the numbers.
Related reading
Getting Started
What is Kwiro?
Learn what Kwiro does, who it's for, and how it turns your WooCommerce store into a 24/7 sales machine.
ReadDaily Use
Your Dashboard Overview
A guided tour of the Kwiro dashboard -- see your AI's performance, revenue impact, and customer conversations at a glance.
ReadCustomization
Customizing Your Widget
Match the Kwiro chat widget to your store's brand with custom colors, position, welcome message, and bot name.
Read