More Sites Blocking LLM Crawling — Could That Backfire on GEO?

Hostinger Analysis Shows Divergence Between AI Training and Assistant Crawlers

Hostinger recently released an analysis indicating that while businesses are increasingly blocking AI systems used to train large language models (LLMs), they continue to permit AI assistants to access and summarize their websites. The study, which examined 66.7 billion bot interactions across five million websites, found that AI assistant crawlers—such as those employed by ChatGPT—are reaching more websites even as companies restrict other forms of AI access.

About the Hostinger Analysis
Hostinger is both a web hosting provider and a no-code AI agent-driven platform for building online businesses. The company analyzed anonymized website logs to evaluate how verified crawlers access sites at scale, enabling a comparison of trends in how search engines and AI systems retrieve online content.

The findings reveal that AI assistant crawlers expanded their reach over a five-month period, with data collected during three six-day windows in June, August, and November 2025:

  • OpenAI’s SearchBot increased coverage from 52% to 68% of sites.

  • Applebot, which indexes content for Apple’s search features, doubled its reach from 17% to 34%.

  • Traditional search engine crawlers remained largely constant during the same period.

These trends suggest that AI assistants are adding a complementary layer to information discovery rather than replacing conventional search engines.

Blocking AI Training Crawlers
Simultaneously, businesses are significantly restricting access for AI training crawlers:

  • OpenAI’s GPTBot fell from accessing 84% of websites in August to 12% by November.

  • Meta’s ExternalAgent decreased coverage from 60% to 41% of websites.

These crawlers collect data over time to improve AI models and update their parametric knowledge. Many organizations are limiting access either to control data usage or due to concerns regarding copyright infringement.

Understanding Parametric Knowledge
Parametric knowledge, also referred to as parametric memory, consists of information embedded directly into a model during training. The term “parametric” reflects that knowledge is stored in the model’s parameters (weights). This knowledge serves as a long-term memory of entities such as people, companies, and products.

When an LLM receives a query, it may recognize an entity and retrieve the associated learned information. By blocking a training bot, a business prevents the model from learning directly about its brand, potentially limiting visibility within AI-driven systems. Conversely, allowing AI training bots to crawl a website provides an opportunity to shape the model’s understanding, including branding, products, and services. Informational sites may particularly benefit from being cited in AI-generated answers.

The Paradox: Opting Out of Parametric Knowledge
Hostinger’s analysis highlights a notable paradox:

“Companies are aggressively blocking AI training bots—the systems that scrape content to build AI models. OpenAI’s GPTBot dropped from 84% to 12% of websites in three months. However, AI assistant crawlers, the technology that ChatGPT, Apple, etc. use to answer customer questions, are expanding rapidly. OpenAI’s SearchBot grew from 52% to 68% of sites; Applebot doubled to 34%.”

Blocking AI training bots effectively removes a site from the parametric knowledge of LLMs, forcing models to rely on third-party information instead of first-party content. This approach limits the organization’s ability to communicate its own story directly to AI systems.

Industry Perspectives on Blocking LLMs
A recent Reddit discussion illustrates the rationale for restricting LLM access to protect intellectual property:

“I want to make sure my site is continued to be indexed in Google Search, but do not want Gemini, ChatGPT, or others to scrape and use my content.”

Participants noted that unique content, such as niche instructional information, may be at risk of being reproduced by AI, reducing traffic to the original site. While this approach may be justified for sites publishing highly specialized content, blocking AI training crawlers may be less advantageous for sites with non-unique material, such as product review or e-commerce platforms. In such cases, allowing AI systems to incorporate their content into parametric memory can improve AI visibility and influence.

Brand Messaging Risks in the Age of LLMs

As AI assistants increasingly provide direct answers to user queries, businesses face the possibility that users may obtain information without visiting their websites. This shift can reduce direct traffic and limit exposure to key content, such as pricing, product context, and brand messaging. When organizations block LLMs from acquiring knowledge about their offerings, they effectively rely on traditional search crawlers and indexes to convey this information—an approach that may only partially compensate for the loss of direct engagement.

The growing adoption of AI assistants has implications beyond marketing, extending to revenue forecasting and analytics. When AI systems summarize products, offers, and recommendations, companies that restrict LLM access lose control over how pricing, value propositions, and other key information are presented. Advertising visibility diminishes earlier in the customer decision process, and e-commerce attribution becomes more complex, as purchases may occur in response to AI-generated summaries rather than direct site visits.

Hostinger notes that some organizations are becoming increasingly selective about the content made available to AI, particularly AI assistants.

Tomas Rasymas, Head of AI at Hostinger, observed:

“With AI assistants increasingly answering questions directly, the web is shifting from a click-driven model to an agent-mediated one. The real risk for businesses isn’t AI access itself, but losing control over how pricing, positioning, and value are presented when decisions are made.”

Key Takeaway

Blocking LLMs from using website content for training should not necessarily be considered the default strategy. While concerns about AI learning from proprietary content are understandable, organizations may benefit from adopting a measured approach—carefully weighing potential benefits against perceived disadvantages to determine an optimal strategy.