INSIGHTS

Prompt Volume Data Is Stolen And Useless

Prompt Volume Data Is Stolen And Useless

Marketing teams should avoid solutions built on prompt volume data. It's attractive at first glance, but unethical, misleading, and useless.

Marketing teams should avoid solutions built on prompt volume data. It's attractive at first glance, but unethical, misleading, and useless.

Keller Maloney

Unusual - Founder

Apr 19, 2026

The "real prompt data" that AEO and GEO vendors sell as a window into what buyers are actually asking ChatGPT is, in many cases, scraped from a small set of users who unwittingly installed spyware Chrome extensions that recorded their ChatGPT conversations and sold them on the black market. That is the clearest conclusion from security research published this month.

Beyond the ethical and legal concerns of this finding, it also reveals something more important: prompt volume datasets are sufficiently flawed to render them useless for AI brand tracking.

The fact that it is stolen means that you don't want it. The dataset shows conversations from people who downloaded shady Chrome VPN extensions, how confident are you that that group represents your buyers?

In almost every case, the answer is straightforward: it doesn't capture your buyers' conversations with AI. It captures some other buyers conversations with AI. Teams who rely on data that doesn't represent their own buyers will draw faulty conclusions.

The takeaway is that marketers evaluating AI brand perception tools should avoid solutions built on prompt volume data. The data is fundamentally biased, and building strategy based on biased data will do nothing in the best case and will do harm in the worst case.

Secret prompt harvesting

Koi Security found that Urban VPN, a Chrome extension with over 6 million users and a "Featured" badge in both the Chrome and Edge stores, has been secretly harvesting every AI conversation its users have had since a July 2025 silent auto-update. Across Urban VPN and its sister extensions from the same publisher, over 8 million users are affected.

The harvested data includes prompts, model responses, and timestamps across major AI models. The extension intercepts this traffic and ships it to a data broker called BiScience, which packages and resells it as "marketing analytics." AI search analytics tools buy the data from BiScience and sell it to their customers.

Users never consented. The store listings claimed user data was not sold to third parties. It was.

Why they need to steal the data

OpenAI doesn't release prompt data. Neither does Anthropic, Google, or any other major AI provider. There is no legitimate, at-scale source of real user prompts for purchase.

The data exists because someone took it without the user's knowledge.

Why the data is useless

There are four major flaws with prompt volume data that render it useless for strategy.

The sample is biased. The dataset isn't what people ask ChatGPT. It's what people who installed a shady VPN extension ask ChatGPT. The sophistication of that group almost certainly doesn't match an enterprise buyer.

The data is too narrow. These datasets count specific prompts, not entire conversations. This is not how buyers actually interact with AI. The average ChatGPT conversation is eight messages long. Buyers start broad, add constraints like budget, integrations, and team size, and only accept a recommendation once the model has absorbed their full context. A dataset of one-off prompts can't reconstruct the conversation around it.

You can't see the user's context. AI models personalize their responses to the user asking the question. If 1,000 people ask an AI model, "What car should I buy?", the model will respond differently to each person (it might recommend an SUV to a family with kids, a sports car to a single person). Even with a perfect list of buyer prompts, you cannot see what ChatGPT is telling those buyers about you, because the answer changes every time the question is asked.

Prompt tracking is fragile. I ran an experiment on 100 prompts asking about CRMs. I ran each prompt, then I rewrote each one with a single meaningless synonym swap ("best CRM" to "top CRM", "scalable" to "high-volume"), keeping everything else identical. The AI responses between sets were wildly unstable: a single brand's "share of voice" moved by as much as 17% from the original prompts to the synonym-swapped prompts, 33% of the vendors in a typical answer changed identity between versions, and only 16 of 100 prompt pairs produced identical vendor sets. Harvested prompts capture the exact phrasings some group of users happened to type, and your buyers will phrase things differently.

The implication

Marketing teams should avoid AI brand tracking tools that are built on prompt volume data. It's attractive at first glance, but the fact that it is stolen means that it's biased.

Building strategy based on biased data will do nothing in the best case and might backfire in the worst case.