
The language on some private Facebook groups and Telegram channels becomes strangely technical late at night. Sellers promote “verified labeling accounts,” picture archives, and occasionally even whole datasets that were gathered years ago and then forgotten. The posts may appear to be ordinary internet spam to someone who is not involved in the artificial intelligence sector.
However, they allude to something more peculiar within the AI economy. A quiet marketplace for training data. Similar to how factories used to run on coal, the current AI boom is powered by information. Massive amounts of examples are needed for voice assistants, large language models, and image generators—billions of images, brief conversations, and pieces of text. The systems just cannot learn without those examples.
| Key Information | Details |
|---|---|
| Industry | Artificial Intelligence Training Data Market |
| Estimated Market Value | ~$2.5 billion currently |
| Forecast Growth | Up to ~$30 billion within a decade |
| Key Players | OpenAI, Google, Meta, Microsoft, Amazon |
| Data Types | Images, text, videos, chat logs, voice samples |
| Data Brokers | Defined.ai, Shutterstock, and independent suppliers |
| Data Sources | Archived websites, media libraries, contractor datasets |
| Ethical Concerns | Copyright, consent, personal data exposure |
| Key Activity | Buying, licensing, and reselling datasets for AI model training |
| Reference | https://www.reuters.com/technology |
And an unanticipated secondary economy has emerged as a result of that demand. According to market researchers, the global market for AI training data is currently valued at about $2.5 billion and could reach $30 billion in the next ten years. The majority of that trade takes place through legal licensing agreements between content creators and tech firms. Publishing companies, media archives, and stock image libraries have found that old data can suddenly regain value.
However, not all datasets are transferred through official contracts. Some of it drifts into murkier channels. This shadow market emerges at a specific time. It usually starts when gig workers who are involved in AI training, such as those who label photos or assess chatbot responses, start discussing their accounts in private.
If you’ve never seen it before, the piece itself is peculiar. Around the world, contractors sit in internet cafés or bedrooms and click through thousands of photos, annotating objects, grading sentences, and editing chatbot responses. One worker in Nairobi once described it as “teaching a machine to notice things humans already understand.”
It can pay surprisingly well. Some specialized jobs pay more than $100 per hour. Thus, having access to those accounts is worthwhile. Soon enough, a secondary trade appears. Verified accounts for platforms connected to AI training companies begin circulating online, offered to buyers who either failed the screening tests or live in regions where projects aren’t available. Some vendors demand payment up front. Some want a portion of future profits.
Investigators have uncovered more than a hundred social media groups openly advertising these accounts. In many cases, the companies involved ban the practice and try to shut down listings. But new groups appear almost immediately, moving conversations to encrypted messaging apps where moderation becomes harder.
After observing these groups for some time, a recurring pattern emerges. Customers are concerned about being conned. Sellers fear being apprehended. Everyone asserts that their access is authorized.
However, the account trade is just one aspect of the situation. The vast collections of text, images, and videos that teach AI systems how the world functions are the center of the larger economy.
These days, old internet archives are especially fascinating. Think about websites like Photobucket, which in the early days of social media once held tens of billions of images. Those pictures, digital artifacts from a different internet, were largely ignored for years. They are now viewed differently by tech companies. instructional materials.
Depending on the quality and uniqueness of each image, executives from a number of data companies have discreetly discussed licensing agreements that range in price from a few cents to several dollars. Rare datasets, voice recordings, and videos can fetch even higher prices.
Entire content libraries are being repurposed in certain instances. It seems as though the neglected areas of the internet are suddenly worthwhile once more. However, the procedure brings up awkward issues. Sometimes AI systems replicate parts of their training data, such as an identifiable watermark, a line of copyrighted text, or even a picture that looks a lot like a real person.
Because of this potential, sourcing datasets becomes more delicate. Regulators and attorneys have begun to take notice.
A growing number of copyright lawsuits have already been filed against AI developers who are alleged to have trained systems using content that was illegally scraped from websites. As a result, businesses have begun negotiating licensing agreements with online forums, publishers, and image libraries. It’s like a gold rush.
These days, data brokers offer contracts to photographers, podcasters, filmmakers, and medical researchers to license their content for AI training. A portion of the profits are sometimes given to content creators.
In others, it becomes more difficult to identify the source of the data. The speed of the AI race itself may be a contributing factor in the existence of the shadow market. Every year, tech companies create bigger models, each of which needs more training examples. There is tremendous pressure to gather data fast.
Additionally, gray markets frequently emerge when industries move quickly. Watching this unfold, there’s a strange sense that the raw material of artificial intelligence isn’t code or processors. It’s human experience itself — photographs, conversations, articles, recordings — collected across decades of internet activity.
Now those fragments are being gathered again, sorted, labeled, and fed into machines learning to mimic human behavior. Some of it happens through billion-dollar contracts signed in quiet boardrooms. Some of it happens in encrypted chat groups where anonymous sellers promise access to datasets nobody else can find. Both worlds, in their own way, are helping build the same thing.
