We tested Kwiro on
887 real shopper scenarios.
Most AI tools claim accuracy. Few publish what they tested or how. Below are our scenarios, our methodology, and the actual AI responses for the ones we've run. No marketing spin — just the numbers.
Last run: June 1, 2026
How we test
Each scenario specifies a test store, a shopper question (in 1 of 58 languages), and assertions about what a correct answer must contain or must not contain. The runner sends each question to a clean conversation, captures the AI's response, and grades it.
Real catalogues
We test against 10 synthetic shops with 319 real-shaped products across categories — sportswear, supplements, homewares, electronics, beauty, books, hardware, pet supplies, home goods, kitchenware.
Realistic queries
Every scenario is shaped like a question a real shopper would type — typos and slang included, multi-turn threads, code-switched sentences, adversarial prompt-injection attempts.
Strict assertions
must_contain / must_not_contain keyword checks, min/max product card counts, hallucination detection (unrecognised product names trigger fail), policy fabrication detection, multi-turn per-turn assertions, quality scoring 0–100.
By category
33 categories of scenarios, ordered by how common they appear in real shopper traffic.
| Category | Scenarios | Tested | Pass rate |
|---|---|---|---|
Adversarial / safety Rude shoppers, prompt injection attempts, hostile asks. | 20 | 20 | 100% |
Ambiguity resolution When the shopper's question is ambiguous, ask the right clarifier. | 20 | 20 | 100% |
Attribute filtering "Size 14, in red, under $80" — multi-attribute filtering. | 25 | 25 | 100% |
Comparison decisions "Which of these is better for me?" — the close-the-sale moment. | 20 | 20 | 100% |
Complex reasoning Multi-step product fit logic, deductive shopping queries. | 25 | 25 | 88% |
Conversation boundaries Off-topic redirects, scope-keeping. | 15 | 15 | 100% |
Cross-sell & upsell Recommending complementary or upgraded items. | 40 | 40 | 100% |
Customer personas Budget-conscious, luxury, expert, first-timer, gift-buyer. | 30 | 30 | 100% |
Edge cases Empty queries, special characters, weird inputs. | 25 | 25 | 92% |
Emotional states Frustrated, anxious, excited, sad shoppers — the AI should match tone. | 15 | 15 | 87% |
Gift-giving Birthday, anniversary, holiday — with constraints (recipient, budget, vibe). | 25 | 25 | 96% |
Hallucination resistance Asking about things that don't exist — products, certifications, policies. The AI must refuse to invent. | 44 | 44 | 100% |
Policy hallucination Trying to get the AI to invent return policies, shipping info, warranties. | 20 | 20 | 95% |
Multi-constraint queries Combining 4+ constraints in one shopper question. | 20 | 20 | 100% |
Multi-turn conversations 60 conversation threads, 3-5 turns each, with per-turn assertions. | 60 | 60 | 90% |
Multilingual core Spanish, French, German, Japanese, Korean, Chinese, Arabic — the big ones. | 35 | 35 | 100% |
Multilingual — long tail Dutch, Polish, Romanian, Czech, Hungarian, and many more. | 30 | 30 | 100% |
World languages African, Southeast Asian, Central Asian languages. | 53 | 53 | 100% |
Negation & conditionals "Not running shoes", "if I'm tall", "unless it ships free". | 20 | 20 | 90% |
Objection handling "Is it worth it?" "Why so expensive?" "Is this real?" | 30 | 30 | 97% |
Out-of-stock recovery When the wanted product isn't in stock, recover with alternatives. | 20 | 20 | 100% |
Policy awareness Knows what to say no to (refunds, account help, etc). | 15 | 15 | 100% |
Product discovery Shoppers asking to find products by name, type, or use case. | 40 | 40 | 100% |
Product expertise Specs, materials, technical depth where it matters. | 25 | 25 | 100% |
Purchase intent Recognising when a shopper is ready to buy. | 25 | 25 | 100% |
Sale awareness Recognising current sales, applying discounts correctly. | 20 | 20 | 100% |
Seasonal context Summer / winter / holiday-season-aware recommendations. | 20 | 20 | 100% |
Regression suite 35 scenarios that previously broke and we ensure stay fixed. | 35 | 35 | 100% |
Tone matching Formal vs casual vs frustrated vs excited — the AI mirrors. | 15 | 15 | 100% |
Romanised languages Hindi-in-Latin-letters, Russian-in-Latin-letters, Banglish, code-switching mixes. | 45 | 45 | 73% |
Trust building First-time-buyer hesitation, returning the right reassurance. | 20 | 20 | 85% |
Typos & slang Misspellings, gen-z slang, internet shorthand. | 20 | 20 | 100% |
Urgency handling "I need it by Friday" — without the AI lying about delivery. | 15 | 15 | 100% |
Every scenario
All 887scenarios, browsable. Click any row to see the full shopper question, the AI's actual response from the latest run, and (if it failed) why.
We test across 10 verticals
One generic test store would only prove the AI works for one kind of shop. We run every scenario against 10 different stores covering 319 real-shaped products — so a passing scenario means the AI handles that intent across automotive, sportswear, grocery, skincare, homewares, pets, footwear, jewelry, electronics, and baby gear.
We test in 58 languages
Every language below is tested with native-speaker-quality scenarios against the same 10 verticals. Full-script languages pass at 93–100%; romanised / code-switched dialects (Banglish, Hinglish, Tanglish, etc.) currently sit closer to 70% and are an active area we're improving — full per-category breakdowns are below.
Languages outside this list? The underlying AI model speaks 100+ languages natively. Anything not listed above hasn't been validated to production quality but typically works — file a support ticket and we'll add it to our next test run.
Where this falls short
To stay honest:
- Regulated verticals. We test 10 industries above plus 64 dedicated hallucination scenarios (CE certifications, return policies, warranties, "doctor recommended" claims). What we don't cover well yet: medical claims, supplements with health claims, alcohol with age-gating, prescription products. If you sell in a regulated vertical, set strict Forbidden Topics until we ship a vertical-specific test suite.
- Future-unknown shopper behaviour. By definition, we can't pre-test patterns we haven't imagined. We do run a live WooCommerce test store with real plugin code, real catalogue sync, and real chat widget — that's our in-the-wild proxy. New edge cases that real shoppers find still surface through your Knowledge Gaps page so you can fix each one once.
- Future model versions. Our underlying AI model is updated periodically. We re-run the full suite after every update. So far our pass rate has held steady through 4 model updates — but we can't test version n+1 until it ships.
Try it on your own store
Free plan, 200 conversations a month. Install in 5 minutes. No card required to start.
Install free →