Baidu Forbids Google and Bing from
Scraping Content to Train AI
Baidu forbids Google and Bing from scraping content to train AI.
Chinese internet search giant Baidu recently updated its Baike service to stop Google and Microsoft Bing from scraping its content. This change aims to protect Baidu’s valuable data amid rising demand for large datasets used in training AI models.
Baidu’s Robots.txt Update Blocks Googlebot and Bingbot
On August 8, Baidu changed the robots.txt file on its Wikipedia-like Baike platform. This update blocks Googlebot and Bingbot crawlers from accessing Baidu Baike’s central repository. Baidu Baike hosts around 30 million entries, a vast resource previously accessible to Google and Bing.
Before this update, some subdomains of Baike were blocked. However, the main repository remained open to these search engines. The new robots.txt settings now prevent access to the core content.
Context: Growing Demand for AI Training Data
This move by Baidu reflects a global trend. Companies increasingly restrict web scraping to protect their content as AI development relies heavily on large datasets. For example, in July, Reddit blocked many search engines except Google from indexing its posts.
Unlike Reddit, which has a financial agreement with Google for data access to train AI, Baidu has chosen to block Google and Bing entirely. Meanwhile, Microsoft reportedly considered restricting internet search data access for rival AI services within the past year.
Baidu Baike Content Still Appears in Search Results
Despite the new restrictions, Baidu Baike content still appears in Bing and Google search results. The South China Morning Post found that many Baike entries remain indexed. This likely happens because search engines use cached versions of the content from before the block.
At the same time, the Chinese Wikipedia, which has about 1.43 million entries, remains open to search engine crawlers.
Partnerships Between AI Developers and Content Publishers
The data protection moves come as AI developers seek high-quality content for training. OpenAI recently signed agreements with Time magazine and the Financial Times, gaining access to extensive archives.
Such partnerships highlight the growing value of carefully curated datasets. Baidu’s restriction signals that valuable content owners want control over how their data is accessed and used.
The Future of Data Access in the AI Era
Online platforms are changing how they manage content access. Many now limit or monetise data sharing to protect their assets and benefit financially.
As AI grows, more companies will likely review their data-sharing policies. This could lead to further changes in how search engines index and access online information.
Baidu forbids Google and Bing from scraping content to train AI
Discover more from Postbox Live
Subscribe to get the latest posts sent to your email.