Hostinger released an analysis showing that businesses are blocking AI systems used to train large language models while allowing AI assistants to continue to read and summarize more websites. The company examined 66.7 billion bot interactions across 5 million websites and found that AI assistant crawlers used by tools such as ChatGPT now reach more sites even as companies restrict other forms of AI access.

Hostinger Analysis

Hostinger is a web host and also a no-code, AI agent-driven platform for building online businesses. The company said it analyzed anonymized website logs to measure how verified crawlers access sites at scale, allowing it to compare changes in how search engines and AI systems retrieve online content.

The analysis they published shows that AI assistant crawlers expanded their reach across websites during a five-month period. Data was collected during three six-day windows in June, August, and November 2025.

OpenAI’s SearchBot increased coverage from 52 percent to 68 percent of sites, while Applebot (which indexes content for powering Apple’s search features) doubled from 17 percent to 34 percent. During the same period, traditional search crawlers essentially remained constant. The data indicates that AI assistants are adding a new layer to how information reaches users rather than replacing search engines outright.

At the same time, the data shows that companies sharply reduced access for AI training crawlers. OpenAI’s GPTBot dropped from access on 84 percent of websites in August to 12 percent by November. Meta’s ExternalAgent dropped from 60 percent coverage to 41 percent website coverage. These crawlers collect data over time to improve AI models and update their Parametric Knowledge but many businesses are blocking them, either to limit data use or for fear of copyright infringement issues.

Parametric Knowledge

Parametric Knowledge, also known as Parametric Memory, is the information that is “hard-coded” into the model during training. It is called “parametric” because the knowledge is stored in the model’s parameters (the weights). Parametric Knowledge is long-term memory about entities, for example, people, things, and companies.

When a person asks an LLM a question, the LLM may recognize an entity like a business and then retrieve the the associated vectors (facts) that it learned during training. So, when a business or company blocks a training bot from their website, they’re keeping the LLM from knowing anything about them, which might not be the best thing for an organization that’s concerned about AI visibility.

Allowing an AI training bot to crawl a company website enables that company to exercise some control over what the LLM knows about it, including what it does, branding, whatever is in the About Us, and enables the LLM to know about the products or services offered. An informational site may benefit from being cited for answers.

Businesses Are Opting Out Of Parametric Knowledge

Hostinger’s analysis shows that businesses are “aggressively” blocking AI training crawlers. While Hostinger’s research doesn’t mention this, the effect of blocking AI training bots is that businesses are essentially opting out of LLM’s parametric knowledge because the LLM is prevented from learning directly from first-party content during training, removing the site’s ability to tell its own story and forcing the LLM to rely on third-party data or knowledge graphs.

Hostinger’s research shows:

“Based on tracking 66.7 billion bot interactions across 5 million websites, Hostinger uncovered a significant paradox:

Companies are aggressively blocking AI training bots, the systems that scrape content to build AI models. OpenAI’s GPTBot dropped from 84% to 12% of websites in three months.

However, AI assistant crawlers, the technology that ChatGPT, Apple, etc. use to answer customer questions, are expanding rapidly. OpenAI’s SearchBot grew from 52% to 68% of sites; Applebot doubled to 34%.”

A recent post on Reddit shows how blocking LLM access to content is normalized and understood as something to protect intellectual property (IP).

The post starts with an initial question asking how to block AIs:

“I want to make sure my site is continued to be indexed in Google Search, but do not want Gemini, ChatGPT, or others to scrape and use my content.

What’s the best way to do this?”

Screenshot Of A Reddit Conversation

Later on in that thread someone asked if they’re blocking LLMs to protect their intellectual property and the original poster responded affirmatively, that that was the reason.

The person who started the discussion responded:

“We publish unique content that doesn’t really exist elsewhere. LLMs often learn about things in this tiny niche from us. So we need Google traffic but not LLMs.”

That may be a valid reason. A site that publishes unique instructional information about a software product that does not exist elsewhere may want to block an LLM from indexing their content because if they don’t then the LLM will be able to answer questions while also removing the need to visit the site.

But for other sites with less unique content, like a product review and comparison site or an ecommerce site, it might not be the best strategy to block LLMs from adding information about those sites into their parametric memory.

Brand Messaging Is Lost To LLMs

As AI assistants answer questions directly, users may receive information without needing to visit a website. This can reduce direct traffic and limit the reach of a business’s pricing details, product context, and brand messaging. It’s possible that the customer journey ends inside the AI interface and the businesses that block LLMs from acquiring knowledge about their companies and offerings are essentially relying on the search crawler and search index to fill that gap (and maybe that works?).

The increasing use of AI assistants affects marketing and extends into revenue forecasting. When AI systems summarize offers and recommendations, companies that block LLMs have less control over how pricing and value appear. Advertising efforts lose visibility earlier in the decision process, and ecommerce attribution becomes harder when purchases follow AI-generated answers rather than direct site visits.

According to Hostinger, some organizations are becoming more selective about what which content is available to AI, especially AI assistants.

Tomas Rasymas, Head of AI at Hostinger commented:

“With AI assistants increasingly answering questions directly, the web is shifting from a click-driven model to an agent-mediated one. The real risk for businesses isn’t AI access itself, but losing control over how pricing, positioning, and value are presented when decisions are made.”

Takeaway

Blocking LLMs from using website data for training is not really the default position to take, even though many people feel real anger and annoyance of the idea of an LLM training on their content.  It may be useful to take a more considered response that weighs the benefits versus the disadvantages and to also consider whether those disadvantages are real or perceived.

Featured Image by Shutterstock/Lightspring



Source link

Avatar photo

By Rose Milev

I always want to learn something new. SEO is my passion.

Leave a Reply

Your email address will not be published. Required fields are marked *