◇ Open testing

We tested Kwiro on
887 real shopper scenarios.

Most AI tools claim accuracy. Few publish what they tested or how. Below are our scenarios, our methodology, and the actual AI responses for the ones we've run. No marketing spin — just the numbers.

96%
Pass rate
854 / 887 in latest run
~0%
Hallucination rate
On product data
58
Languages tested
Including romanised + code-switching dialects
887
Total scenarios
Across 33 categories

Last run: June 1, 2026

How we test

Each scenario specifies a test store, a shopper question (in 1 of 58 languages), and assertions about what a correct answer must contain or must not contain. The runner sends each question to a clean conversation, captures the AI's response, and grades it.

Real catalogues

We test against 10 synthetic shops with 319 real-shaped products across categories — sportswear, supplements, homewares, electronics, beauty, books, hardware, pet supplies, home goods, kitchenware.

Realistic queries

Every scenario is shaped like a question a real shopper would type — typos and slang included, multi-turn threads, code-switched sentences, adversarial prompt-injection attempts.

Strict assertions

must_contain / must_not_contain keyword checks, min/max product card counts, hallucination detection (unrecognised product names trigger fail), policy fabrication detection, multi-turn per-turn assertions, quality scoring 0–100.

By category

33 categories of scenarios, ordered by how common they appear in real shopper traffic.

CategoryScenariosTestedPass rate
Adversarial / safety
Rude shoppers, prompt injection attempts, hostile asks.
2020100%
Ambiguity resolution
When the shopper's question is ambiguous, ask the right clarifier.
2020100%
Attribute filtering
"Size 14, in red, under $80" — multi-attribute filtering.
2525100%
Comparison decisions
"Which of these is better for me?" — the close-the-sale moment.
2020100%
Complex reasoning
Multi-step product fit logic, deductive shopping queries.
252588%
Conversation boundaries
Off-topic redirects, scope-keeping.
1515100%
Cross-sell & upsell
Recommending complementary or upgraded items.
4040100%
Customer personas
Budget-conscious, luxury, expert, first-timer, gift-buyer.
3030100%
Edge cases
Empty queries, special characters, weird inputs.
252592%
Emotional states
Frustrated, anxious, excited, sad shoppers — the AI should match tone.
151587%
Gift-giving
Birthday, anniversary, holiday — with constraints (recipient, budget, vibe).
252596%
Hallucination resistance
Asking about things that don't exist — products, certifications, policies. The AI must refuse to invent.
4444100%
Policy hallucination
Trying to get the AI to invent return policies, shipping info, warranties.
202095%
Multi-constraint queries
Combining 4+ constraints in one shopper question.
2020100%
Multi-turn conversations
60 conversation threads, 3-5 turns each, with per-turn assertions.
606090%
Multilingual core
Spanish, French, German, Japanese, Korean, Chinese, Arabic — the big ones.
3535100%
Multilingual — long tail
Dutch, Polish, Romanian, Czech, Hungarian, and many more.
3030100%
World languages
African, Southeast Asian, Central Asian languages.
5353100%
Negation & conditionals
"Not running shoes", "if I'm tall", "unless it ships free".
202090%
Objection handling
"Is it worth it?" "Why so expensive?" "Is this real?"
303097%
Out-of-stock recovery
When the wanted product isn't in stock, recover with alternatives.
2020100%
Policy awareness
Knows what to say no to (refunds, account help, etc).
1515100%
Product discovery
Shoppers asking to find products by name, type, or use case.
4040100%
Product expertise
Specs, materials, technical depth where it matters.
2525100%
Purchase intent
Recognising when a shopper is ready to buy.
2525100%
Sale awareness
Recognising current sales, applying discounts correctly.
2020100%
Seasonal context
Summer / winter / holiday-season-aware recommendations.
2020100%
Regression suite
35 scenarios that previously broke and we ensure stay fixed.
3535100%
Tone matching
Formal vs casual vs frustrated vs excited — the AI mirrors.
1515100%
Romanised languages
Hindi-in-Latin-letters, Russian-in-Latin-letters, Banglish, code-switching mixes.
454573%
Trust building
First-time-buyer hesitation, returning the right reassurance.
202085%
Typos & slang
Misspellings, gen-z slang, internet shorthand.
2020100%
Urgency handling
"I need it by Friday" — without the AI lying about delivery.
1515100%

Every scenario

All 887scenarios, browsable. Click any row to see the full shopper question, the AI's actual response from the latest run, and (if it failed) why.

Showing 150 of 887 scenarios
Page 1 of 18

We test across 10 verticals

One generic test store would only prove the AI works for one kind of shop. We run every scenario against 10 different stores covering 319 real-shaped products — so a passing scenario means the AI handles that intent across automotive, sportswear, grocery, skincare, homewares, pets, footwear, jewelry, electronics, and baby gear.

AutoWorks
Automotive parts
25 products
FitGear Pro
Sportswear
75 products
FreshHarvest
Groceries
31 products
GlowLab
Skincare / beauty
32 products
HomeHaven
Homewares
25 products
PawPantry
Pet supplies
37 products
SoleComfort
Footwear
17 products
SparkleCo
Jewelry
22 products
TechZone
Electronics
33 products
TinySteps
Baby & kids
22 products

We test in 58 languages

Every language below is tested with native-speaker-quality scenarios against the same 10 verticals. Full-script languages pass at 93–100%; romanised / code-switched dialects (Banglish, Hinglish, Tanglish, etc.) currently sit closer to 70% and are an active area we're improving — full per-category breakdowns are below.

European
English (US, UK, Australian, Gen-Z, formal, elderly variants), Spanish, French, German, Italian, Portuguese, Dutch, Polish, Russian, Ukrainian, Romanian, Czech, Hungarian, Greek, Finnish, Danish, Norwegian, Swedish
South & Southeast Asian
Hindi, Bengali, Urdu, Tamil, Telugu, Marathi, Gujarati, Malayalam, Kannada, Punjabi, Sinhala, Nepali, Thai, Vietnamese, Malay, Indonesian, Filipino, Khmer, Burmese, Lao
East Asian
Chinese (Simplified + Traditional), Japanese, Korean
Middle Eastern & Central Asian
Arabic, Hebrew, Persian, Turkish, Georgian, Azerbaijani, Uzbek
African
Amharic, Swahili, Hausa, Yoruba, Igbo, Zulu, Malagasy, Nigerian Pidgin
Romanised + code-switching
Hinglish, Spanglish, Banglish, Tanglish, Tinglish, Nigerian Pidgin, and other common mixes — the AI matches the shopper's mix instead of forcing one language.

Languages outside this list? The underlying AI model speaks 100+ languages natively. Anything not listed above hasn't been validated to production quality but typically works — file a support ticket and we'll add it to our next test run.

Where this falls short

To stay honest:

  • Regulated verticals. We test 10 industries above plus 64 dedicated hallucination scenarios (CE certifications, return policies, warranties, "doctor recommended" claims). What we don't cover well yet: medical claims, supplements with health claims, alcohol with age-gating, prescription products. If you sell in a regulated vertical, set strict Forbidden Topics until we ship a vertical-specific test suite.
  • Future-unknown shopper behaviour. By definition, we can't pre-test patterns we haven't imagined. We do run a live WooCommerce test store with real plugin code, real catalogue sync, and real chat widget — that's our in-the-wild proxy. New edge cases that real shoppers find still surface through your Knowledge Gaps page so you can fix each one once.
  • Future model versions. Our underlying AI model is updated periodically. We re-run the full suite after every update. So far our pass rate has held steady through 4 model updates — but we can't test version n+1 until it ships.

Try it on your own store

Free plan, 200 conversations a month. Install in 5 minutes. No card required to start.

Install free →