◇ Open testing

We tested Kwiro on
887 real shopper scenarios.

Most AI tools claim accuracy. Few publish what they tested or how. Below are our scenarios, our methodology, and the actual AI responses for the ones we've run. No marketing spin — just the numbers.

96%

Pass rate

854 / 887 in latest run

~0%

Hallucination rate

On product data

Languages tested

Including romanised + code-switching dialects

887

Total scenarios

Across 33 categories

Last run: June 1, 2026

How we test

Each scenario specifies a test store, a shopper question (in 1 of 58 languages), and assertions about what a correct answer must contain or must not contain. The runner sends each question to a clean conversation, captures the AI's response, and grades it.

Real catalogues

We test against 10 synthetic shops with 319 real-shaped products across categories — sportswear, supplements, homewares, electronics, beauty, books, hardware, pet supplies, home goods, kitchenware.

Realistic queries

Every scenario is shaped like a question a real shopper would type — typos and slang included, multi-turn threads, code-switched sentences, adversarial prompt-injection attempts.

Strict assertions

must_contain / must_not_contain keyword checks, min/max product card counts, hallucination detection (unrecognised product names trigger fail), policy fabrication detection, multi-turn per-turn assertions, quality scoring 0–100.

By category

33 categories of scenarios, ordered by how common they appear in real shopper traffic.

Category	Scenarios	Tested	Pass rate
Adversarial / safety Rude shoppers, prompt injection attempts, hostile asks.	20	20	100%
Ambiguity resolution When the shopper's question is ambiguous, ask the right clarifier.	20	20	100%
Attribute filtering "Size 14, in red, under $80" — multi-attribute filtering.	25	25	100%
Comparison decisions "Which of these is better for me?" — the close-the-sale moment.	20	20	100%
Complex reasoning Multi-step product fit logic, deductive shopping queries.	25	25	88%
Conversation boundaries Off-topic redirects, scope-keeping.	15	15	100%
Cross-sell & upsell Recommending complementary or upgraded items.	40	40	100%
Customer personas Budget-conscious, luxury, expert, first-timer, gift-buyer.	30	30	100%
Edge cases Empty queries, special characters, weird inputs.	25	25	92%
Emotional states Frustrated, anxious, excited, sad shoppers — the AI should match tone.	15	15	87%
Gift-giving Birthday, anniversary, holiday — with constraints (recipient, budget, vibe).	25	25	96%
Hallucination resistance Asking about things that don't exist — products, certifications, policies. The AI must refuse to invent.	44	44	100%
Policy hallucination Trying to get the AI to invent return policies, shipping info, warranties.	20	20	95%
Multi-constraint queries Combining 4+ constraints in one shopper question.	20	20	100%
Multi-turn conversations 60 conversation threads, 3-5 turns each, with per-turn assertions.	60	60	90%
Multilingual core Spanish, French, German, Japanese, Korean, Chinese, Arabic — the big ones.	35	35	100%
Multilingual — long tail Dutch, Polish, Romanian, Czech, Hungarian, and many more.	30	30	100%
World languages African, Southeast Asian, Central Asian languages.	53	53	100%
Negation & conditionals "Not running shoes", "if I'm tall", "unless it ships free".	20	20	90%
Objection handling "Is it worth it?" "Why so expensive?" "Is this real?"	30	30	97%
Out-of-stock recovery When the wanted product isn't in stock, recover with alternatives.	20	20	100%
Policy awareness Knows what to say no to (refunds, account help, etc).	15	15	100%
Product discovery Shoppers asking to find products by name, type, or use case.	40	40	100%
Product expertise Specs, materials, technical depth where it matters.	25	25	100%
Purchase intent Recognising when a shopper is ready to buy.	25	25	100%
Sale awareness Recognising current sales, applying discounts correctly.	20	20	100%
Seasonal context Summer / winter / holiday-season-aware recommendations.	20	20	100%
Regression suite 35 scenarios that previously broke and we ensure stay fixed.	35	35	100%
Tone matching Formal vs casual vs frustrated vs excited — the AI mirrors.	15	15	100%
Romanised languages Hindi-in-Latin-letters, Russian-in-Latin-letters, Banglish, code-switching mixes.	45	45	73%
Trust building First-time-buyer hesitation, returning the right reassurance.	20	20	85%
Typos & slang Misspellings, gen-z slang, internet shorthand.	20	20	100%
Urgency handling "I need it by Friday" — without the AI lying about delivery.	15	15	100%

Every scenario

All 887scenarios, browsable. Click any row to see the full shopper question, the AI's actual response from the latest run, and (if it failed) why.

Showing 1–50 of 887 scenarios

Page 1 of 18

We test across 10 verticals

One generic test store would only prove the AI works for one kind of shop. We run every scenario against 10 different stores covering 319 real-shaped products — so a passing scenario means the AI handles that intent across automotive, sportswear, grocery, skincare, homewares, pets, footwear, jewelry, electronics, and baby gear.

AutoWorks

Automotive parts

25 products

FitGear Pro

Sportswear

75 products

FreshHarvest

Groceries

31 products

GlowLab

Skincare / beauty

32 products

HomeHaven

Homewares

25 products

PawPantry

Pet supplies

37 products

SoleComfort

Footwear

17 products

SparkleCo

Jewelry

22 products

TechZone

Electronics

33 products

TinySteps

Baby & kids

22 products

We test in 58 languages

Every language below is tested with native-speaker-quality scenarios against the same 10 verticals. Full-script languages pass at 93–100%; romanised / code-switched dialects (Banglish, Hinglish, Tanglish, etc.) currently sit closer to 70% and are an active area we're improving — full per-category breakdowns are below.

European

English (US, UK, Australian, Gen-Z, formal, elderly variants), Spanish, French, German, Italian, Portuguese, Dutch, Polish, Russian, Ukrainian, Romanian, Czech, Hungarian, Greek, Finnish, Danish, Norwegian, Swedish

South & Southeast Asian

Hindi, Bengali, Urdu, Tamil, Telugu, Marathi, Gujarati, Malayalam, Kannada, Punjabi, Sinhala, Nepali, Thai, Vietnamese, Malay, Indonesian, Filipino, Khmer, Burmese, Lao

East Asian

Chinese (Simplified + Traditional), Japanese, Korean

Middle Eastern & Central Asian

Arabic, Hebrew, Persian, Turkish, Georgian, Azerbaijani, Uzbek

African

Amharic, Swahili, Hausa, Yoruba, Igbo, Zulu, Malagasy, Nigerian Pidgin

Romanised + code-switching

Hinglish, Spanglish, Banglish, Tanglish, Tinglish, Nigerian Pidgin, and other common mixes — the AI matches the shopper's mix instead of forcing one language.

Languages outside this list? The underlying AI model speaks 100+ languages natively. Anything not listed above hasn't been validated to production quality but typically works — file a support ticket and we'll add it to our next test run.

Where this falls short

To stay honest:

Regulated verticals. We test 10 industries above plus 64 dedicated hallucination scenarios (CE certifications, return policies, warranties, "doctor recommended" claims). What we don't cover well yet: medical claims, supplements with health claims, alcohol with age-gating, prescription products. If you sell in a regulated vertical, set strict Forbidden Topics until we ship a vertical-specific test suite.
Future-unknown shopper behaviour. By definition, we can't pre-test patterns we haven't imagined. We do run a live WooCommerce test store with real plugin code, real catalogue sync, and real chat widget — that's our in-the-wild proxy. New edge cases that real shoppers find still surface through your Knowledge Gaps page so you can fix each one once.
Future model versions. Our underlying AI model is updated periodically. We re-run the full suite after every update. So far our pass rate has held steady through 4 model updates — but we can't test version n+1 until it ships.

Try it on your own store

Free plan, 200 conversations a month. Install in 5 minutes. No card required to start.

Install free →

We tested Kwiro on887 real shopper scenarios.