Scott Galloway
Here he is on scraping data:
Reddit Sues Anthropic Over AI Training Data
Reddit has filed a lawsuit against Anthropic, accusing the AI firm of illegally scraping its site over 100,000 times since mid-2024 to train its AI models. Unlike OpenAI and Google, which signed formal licensing deals with Reddit last year, Anthropic reportedly refused to enter an agreement.
Reddit’s 20-year warehouse of user-generated content is a gold mine for AI training: It’s authentic, organized by topic, ranked by a community-driven voting system, and growing fast.
As of January 2025, Reddit has 1.1 billion monthly unique users, a nearly 50% jump since 2022. In the second half of 2024, users posted nearly 6 billion pieces of content — a 12% increase from the first half of the year.
Data licensing has become a lucrative business for Reddit, generating $130 million in 2024 — roughly 10% of its total revenue. The value of Reddit’s data will likely rise as the supply of high-quality language data continues to shrink.
Large language models like ChatGPT and Gemini have already consumed Wikipedia, nearly every published book, and much of the open internet. Each new model demands more data than the last, pushing the industry to chase a shrinking pool of fresh, human-generated content.