What Enterprises Really Ask
Before Trusting AI with Consumer Research
“If we save time and budget but the direction is wrong, we’ll only drift further off course.”
“Luxury purchases sometimes have no logic — even consumers themselves can’t explain what triggered the buy. How does AI handle this unpredictable humanity?”
“On social media, ten people review a product — eight probably never bought it.”
“Can this duplicate and replace traditional market research?”
“I don’t care about consistency. What I care about is accuracy — whether what it infers actually matches what this person really thinks.”
“AI tends to be overly rational — it lacks the irrational, unconventional reactions of real users.”
AI consumer simulation builds models of real consumers from behavioral data, then researches with them as you would with real respondents. The models are called synthetic consumers. The field is new, growing fast, and largely unregulated in how it measures its own accuracy. Nobody has established how far these digital constructs are from the real people they claim to represent in a live commercial scenario.
We build synthetic consumers. Over the past year, we sat across the table from consumer research teams at some of the world's largest companies — FMCG, luxury, food & beverage, automotive, internet, beauty — as they evaluated whether this technology could play a role in understanding their customers. We recorded the technical questions from 36 of these conversations.
After removing commercial and pricing discussions, we were left with over 150 distinct technical concerns. They cluster into four questions that every enterprise works through before deciding whether to trust AI-generated consumer insights enough to act on them. Taken together they draw the map of what “trust” actually requires in this space.
What are enterprises really asking?
Data sourcing
“Did you find the right people to model?”
Simulation fidelity
“How accurately does it reconstruct a real consumer?”
Differentiation
“Why can't I just use ChatGPT?”
Security & access
“Can this get through our front door?”
01. Data sourcing — “Did you find the right people?”
When enterprises learn that synthetic consumers are built from online data, the reaction comes quickly. The skepticism isn't about AI — it's about online data as a source for understanding real consumers.
The concern surfaces in three layers.
The first is noise. A battery company's R&D team described trying to research their own product category on social media — and finding nothing but a wall of sponsored content:
"There's a ton of advertising in there… inherent bias. How do you validate that? How do you prove the insight you end up with is actually correct?"
Behind the noise is the question of who is actually speaking. A fintech research team saw through it clearly: influencer accounts are a business unto themselves — the persona is manufactured, the content is scripted:
"The persona itself is probably manufactured — it's not real."
A model trained on this material may only be learning a performance.
The second layer is structural absence. In many categories, the people generating social media data are categorically different from the people being studied:
"The actual users of adult diapers are elderly people in their sixties and seventies who basically don't use social media. The people discussing this category online are their children. Their children's views and the elderly person's own feelings may be completely different."
B2B faces the same gap — procurement directors and factory floor managers don't share purchase decision logic on social platforms. Luxury has a different version: in high-consideration categories, a large share of online reviewers never purchased the product. The data surface exists; the signal underneath it is someone else's behavior.
The third layer is about depth. When a team asks how many real accounts a single synthetic consumer is built from, or what the minimum information required for a credible one looks like, they're asking a question the field hasn't answered:
"Harry Potter took seven books to build one character. You have 300,000 personas — what's each one's 'seven books'?"
The underlying question is foundational: what is the minimum viable representation of a person, and what does the richest case look like? Without an answer, buyers have no way to know whether what they're paying for is genuine insight or sophisticated hallucination.
02. Simulation fidelity — “How accurately does the AI reconstruct a real consumer?”
2.1 Individual reconstruction — “How far is it from the real person?”
Over half the concerns we recorded land on “fidelity.” But the word conceals significant variation — teams asking whether synthetic consumers are “realistic enough” turn out to be asking three fundamentally different questions.
The insights lead at an FMCG group drew the line immediately. He'd tested multiple AI tools and found the industry's go-to quality check — asking the same question repeatedly and seeing if the answer holds — entirely beside the point:
"I don't care about consistency — whether it gives the same answer twice. I care about accuracy — whether what it infers matches what this person actually thinks."
Consistency is a technical metric with established benchmarks. What he's asking for — whether the model can reconstruct how a real person actually makes decisions — is what the field calls fidelity. It's a business metric, and the industry has no agreed-upon way to measure it.
A luxury research director in Switzerland approached from a different angle. His team relies on depth interviews to surface the contradictions and irrational impulses that actually drive purchase decisions. His concern is that AI smooths away precisely the messiness that makes insight useful:
"AI tends to converge results toward a rational, explainable mean. But real consumer decision distributions aren't normal."
Preferences and attitudes leave traces and are tractable to simulate. But an internet platform's research lead found that once you cross category boundaries, the picture shifts entirely:
"One person's decision logic for buying milk versus buying a tech device might be completely different."
This isn't the familiar attitude-versus-behavior gap. It's that a single person runs fundamentally different decision-making processes depending on the category — and whether simulation technology can follow someone across those switches remains unresolved.
Some teams pushed the questioning beyond individual accuracy into temporal and group-level concerns. An edtech company ran several rounds of testing and found the same synthetic consumer producing different answers over time:
"You ask the same question at a different time, you get a different answer — is that simulating how people change, or is the model itself unstable?"
In group settings, a different worry emerged. When multiple synthetic consumers were assembled into a focus group, the uniformity of their responses was striking:
"Every persona gave the same answer to the same question — that's not what happens when you're sitting across from real people."
If every synthetic consumer is permanently stable and highly consistent, what a researcher holds isn't a population — it's a polished mean. The variance within groups, the drift of individuals over time — the things traditional research builds expensive longitudinal panels to capture — are precisely what AI is most likely to flatten out.
2.2 Capability boundaries — “Which research scenarios work, which don't?”
The second line of questioning skips “how close” and goes straight to testing limits against specific use cases.
A food company tried using synthetic consumers to screen new flavors and hit a wall immediately: the model has no body.
"For something like food and beverage — where you need to actually taste it to have an opinion — on what basis is AI making sensory judgments?"
An internet company's UX team ran into the same barrier. They need the immediate responses of users swiping, tapping, hesitating, abandoning — and the synthetic consumer has never interacted with their product:
"We need feedback from users who've actually used the product. It's never touched our interface — what's its opinion based on?"
It's not just the outcome that matters — the interaction process itself carries information:
"Most of what we're testing isn't a static screen — it flows."
New product teams face an even more extreme version. When the product hasn't launched yet, the synthetic consumer has no experience to draw on — only analogical reasoning:
"The product isn't live yet and no general-purpose LLM has ever seen this product form. Can you really get useful feedback just by describing the concept in a prompt?"
A beauty brand hit yet another boundary when testing packaging visuals. The AI's choices appeared to follow its own aesthetic pattern rather than reflecting how real consumers evaluate design:
"AI seems to favor high-contrast colors when selecting packaging — we suspect that's the model's own visual bias, not how real consumers actually choose."
| Scenario | Experiential cognition required |
|---|---|
01Food & beverage | Taste and sensory perception |
02UX & interface testing | Swipe, tap, and real-time interaction |
03Unlaunched products | Analogical reasoning beyond training data |
04Packaging & visual design | Aesthetic judgment and visual preference |
These scenarios share a common structure: they require first-hand experiential cognition — the body's sensory response, the fingertips navigating an interface, the genuine reaction to encountering something for the first time. Simulation technology cannot provide this in principle. No one has mapped where these boundaries lie; each company is discovering them through trial and error.
2.3 Knowledge reliability — “Can you trust what it says?”
The third line of questioning moves past “how close” and “where does it work” to a more fundamental risk: whether the information a synthetic consumer outputs is itself reliable.
LLM hallucination — output that sounds coherent and logically consistent but is entirely fabricated — is a known problem. It's easy to catch in everyday contexts, but once professional domains are involved, the stakes rise sharply:
"People inside the industry can spot the obvious errors right away."
The operative phrase is “inside the industry.” In one food R&D evaluation, a synthetic consumer's description of a testing method raised immediate alarm:
"The AI told us consumers could use test strips to detect it — we checked, and that's simply not possible with current technology."
A similar case surfaced in a discussion of baking formulations, where the synthetic consumer confused the application contexts of a specific ingredient:
"AOP butter goes in bread and pastry, not cake. If a real person made that mistake in an interview, we'd throw out their data."
In traditional research, human respondents getting facts wrong isn't unusual — you filter them out and continue. But when the same error comes from a synthetic consumer, nobody's first instinct is to “filter out and retry.” The instinct is to question whether the entire system is trustworthy. The same factual mistake: from a human, it's sampling noise; from a machine, it's systemic risk. That asymmetry in tolerance is itself worth examining.
At this point in our conversations, the dynamic shifted. Enterprises stopped only probing for weaknesses and started offering paths forward — could domain knowledge be injected into the model?
"Could we do some optimization on our end… help it understand rigid packaging better?"
A healthcare platform went further, arguing that the depth of industry-specific training data requires collaboration from within the sector:
"Pharma-level corpus depth — probably only a platform like ours could provide that kind of partnership. You'd need full-text comprehension of their academic positions."
Others proposed connecting the model to the enterprise's existing proprietary data:
"I have an ingredients database… how do I get the model to actually use that in research?"
Of the three lines of questioning in this section, this is the only one where enterprises shift from challenger to co-builder. The first two probe the ceiling of what simulation can do; this one begins discussing how to raise it.
03. Differentiation from general AI — “Why can't I just do this with ChatGPT?”
This question tends to arrive at a specific moment: after a team has become interested and is now thinking about how to justify the investment internally.
"I've been trying to give ChatGPT memory and context to simulate consumers. Every platform is doing something similar. What's the essential difference?"
The underlying situation is the same across all of these: a team has already prompted a general-purpose LLM to role-play as a consumer segment, gotten plausible-sounding results, and now needs to understand whether a purpose-built system produces meaningfully different outputs — different enough to justify procurement, integration, and organizational change.
There's also the question of defensibility. An industry partner raised concern about large consulting firms building similar tools:
"If any company with LLM capability can do the same thing, what's the actual technical barrier?"
At its core, this category is a procurement justification question, not a technical one. When a team asks “what's the essential difference?”, they're asking for language they can use internally — with procurement, with IT, with a CFO who already has a ChatGPT subscription. A compelling answer isn't primarily technical. It's about the standards of evidence the tool is held to, and whether those standards are visible and auditable.
04. Data security & enterprise access — “Can this even get through our front door?”
One global company's IT department blocked access to the product website on the first attempt — the company's security policy screens all AI tools by default. The evaluation ended in its first five minutes. This is not an unusual story.
The concerns fall into three categories: data flow (does input leave the organization's control?), data sovereignty (where is data stored and processed, and does that create jurisdictional compliance issues?), and training contamination (does user input feed back into external model development?). On that last point, the line is clear:
"The consumer insights we've accumulated are our core IP. Once uploaded to the platform, they can't leak out, and they won't serve as input for training external models — that's a red line."
Private deployment is the request behind many of these conversations — and where the gap between what enterprises need and what AI tools can currently offer is widest.
What the pattern reveals
Traditional research built trust through disclosed process: a methodology section you could read and evaluate — how the sample was constructed, how participants were recruited, what the margin of error was, where error sources were named. AI simulation doesn't have the equivalent of a methodology section. The four categories above are the enterprise buyer's attempt to construct one, in the absence of the field providing it.
What each category is actually asking for:
| Gate | Validity required | Core verification |
|---|---|---|
| 01Data sourcing | Population Validity | Confidence that the right people were the basis for the model. If the foundation population is misaligned or filled with noise, everything downstream is hallucination. |
| 02Simulation fidelity | Construct Validity | Confidence that the model captures what it claims to capture — irrational impulses, preference shifts, decision-making detours that averages would erase. |
| 03Differentiation | Comparative Validity | Evidence that purpose-built simulation produces outputs materially different from general-purpose prompting — auditable, repeatable, and traceable. |
| 04Security & access | Enterprise Operability | The governance infrastructure — sandboxed deployment, zero training-set contamination — that makes deployment possible at all. |
Confidence that the right people were the basis for the model. If the foundation population is misaligned or filled with noise, everything downstream is hallucination.
Confidence that the model captures what it claims to capture — irrational impulses, preference shifts, decision-making detours that averages would erase.
Evidence that purpose-built simulation produces outputs materially different from general-purpose prompting — auditable, repeatable, and traceable.
The governance infrastructure — sandboxed deployment, zero training-set contamination — that makes deployment possible at all.
These are not product-specific requirements. They are what the field needs to establish before AI consumer simulation functions as a standard research tool rather than an experimental one.
What comes next
We chose to publish these questions openly — rather than answering them one meeting at a time — because they belong to the entire field. We're working through them ourselves using train/test validation against real consumer data, blind-test protocols, and consistency scoring calibrated to human self-agreement baselines.
We'll follow this map with a framework for how sampling, construction, and simulation connect as phases — where the quality of each determines the reliability of the next.