Control AI Crawlers: Stop Data Scraping and Get Cited

TL;DR

AI companies use two types of bots: search bots for citations and training bots for data collection.
Use your `robots.txt` file to allow search bots (like OAI-SearchBot) and block training bots (like GPTBot).
The `llms.txt` file is an emerging way to set usage policies, but it is not yet widely enforced.
The only guaranteed way to block a bot is with a `Disallow` rule in `robots.txt`.
You want AI crawlers to visit your site for citations, but you must control which ones have access.

You want your content cited by ChatGPT and Perplexity. You do not want your proprietary data, customer lists, or unique content feeding their next model update without your permission. This is the central conflict for every site owner in 2026. The good news is you can have one without the other. You can get the benefit of AI visibility while setting clear boundaries on data usage.

The solution is not a total block. A blanket ban on all AI crawlers removes you from the new search ecosystem entirely, killing your chances of getting valuable citations. The key is selective access. You need to understand how to let AI crawl your site without scraping your data, and it starts with knowing which bots to welcome and which to turn away at the door.

How to Let AI Crawl Your Site Without Scraping Your Data

The most effective way to manage AI access is through your website's robots.txt file. This simple text file gives instructions to web crawlers. You can use it to specifically block bots that collect data for model training while allowing bots that retrieve information for live search answers.

This distinction is the most important concept to grasp. AI companies use different bots for different jobs.

Search & Retrieval Bots: These bots power the answers you see in ChatGPT, Perplexity, and Gemini. They visit your site to find specific information to answer a user's prompt. Allowing them is critical for getting citations. Examples include OAI-SearchBot and Perplexity-User.
Training & Data Collection Bots: These bots scrape the web on a massive scale to gather data for training future AI models. Blocking these bots prevents your content from being ingested into their training sets. Examples include GPTBot and Google-Extended.

Your goal is to allow the first group and block the second. This gives you the best of both worlds: your content is visible for citations, but protected from wholesale scraping.

The Two Types of AI Crawlers You Must Know

Not all bots are created equal. Some are essential for visibility in AI answers, while others are purely for data collection. Below is a breakdown of the most common AI crawlers you will see in your server logs.

Vendor	Crawler User-Agent	Purpose	Allow or Block?
OpenAI	`OAI-SearchBot`	Powers real-time search results in ChatGPT.	Allow (Essential for citations)
OpenAI	`ChatGPT-User`	Visits your site when a user specifically asks it to.	Allow
OpenAI	`GPTBot`	General web crawling for model training data.	Block (To prevent data scraping)
Perplexity	`Perplexity-User`	Real-time browsing to find and cite sources for answers.	Allow (Essential for citations)
Perplexity	`PerplexityBot`	General search indexing for their own database.	Allow
Anthropic	`Claude-SearchBot`	Powers real-time search results in Claude.	Allow (Essential for citations)
Anthropic	`ClaudeBot`	General web crawling for model training data.	Block (To prevent data scraping)
Google	`Google-Extended`	Data collection for Gemini and Vertex AI training. Does not affect Google Search ranking.	Block (To prevent data scraping)
Apple	`Applebot-Extended`	Data collection for Apple Intelligence AI training.	Block (To prevent data scraping)

Managing these bots individually can be a full-time job. A dedicated plugin can automate the process by keeping an updated list of AI crawlers and managing your robots.txt file for you. The AI Crawler Log in AEO God Mode helps you see exactly which bots are visiting your site and what pages they access.

AEO God Mode — Free WordPress Plugin Get your site cited by ChatGPT, Perplexity, and Google AI Overviews. Install in under 5 minutes.

Download Free

Step 1: Master Your `robots.txt` for AI Traffic

Your robots.txt file is the first and most important tool for controlling bot access. It is a plain text file located at the root of your domain (e.g., yourdomain.com/robots.txt). The rules are simple.

User-agent: specifies the bot the rule applies to.
Disallow: tells that bot which files or directories it is not allowed to access.

To block a training bot, you add a Disallow: / rule for its specific user agent. This tells the bot it is not allowed to crawl any page on your site.

Here is an example robots.txt configuration that blocks common training bots while allowing search bots:

# Block AI Training Bots
User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Applebot-Extended
Disallow: /

# Allow AI Search Bots (Implicitly allowed, no rule needed)
User-agent: OAI-SearchBot
Allow: /

User-agent: Perplexity-User
Allow: /

In this setup, you explicitly block the bots you don't want. Any bot not listed is allowed by default, including OAI-SearchBot, Perplexity-User, and traditional search crawlers like Googlebot. For a deeper dive into the different OpenAI bots, see our complete guide to GPTBot.

Pro Tip

Check your server logs or use a crawler logging tool regularly. This is the only way to verify which bots are actually visiting your site and whether they are respecting your `robots.txt` rules. Don’t just set the rules and forget them; verify they are working.

Step 2: Use `llms.txt` to Set Usage Policies

While robots.txt is for access control, a newer file called llms.txt is emerging to set usage policies. It's designed to tell AI systems how they are permitted to use your content after they crawl it. This file is part of an open community proposal and lives at the root of your domain, just like robots.txt.

The goal of llms.txt is to provide more detailed instructions. You can specify things like:

Which parts of your site are most important.
Whether your content can be used for training.
How the AI should attribute your site when citing it.

Here is a simple example of an llms.txt file:

# General Policies
User-agent: *
Usage-policy: Training: disallow, Citations: allow with attribution
Attribution-format: "Source: [URL]"

# Specific instructions for certain bots
User-agent: Claude-SearchBot
Content-focus: Technology, WordPress plugins

This file tells all bots they can cite your content with attribution but cannot use it for training. It's a more detailed approach than the simple on/off switch of robots.txt. You can learn more about the differences in our guide on llms.txt vs. robots.txt.

The Data Problem: Is `llms.txt` Actually Enforced?

This is the critical question. As of early 2026, the answer is mostly no. The llms.txt file is a proposal, not an enforced standard. Audits show that major AI crawlers do not consistently request or obey this file. Its value is as forward-looking infrastructure.

The only method with technical enforcement is robots.txt. Reputable companies like OpenAI and Google follow robots.txt directives to avoid legal issues. An llms.txt file is a good signal to send, but you cannot rely on it to protect your data today. You must use robots.txt to enforce your access rules. For a data-driven look at this, see our analysis on whether llms.txt is worth implementing in 2026.

AEO God Mode — Free WordPress Plugin Get your site cited by ChatGPT, Perplexity, and Google AI Overviews. Install in under 5 minutes.

Download Free

Pros

✓ Allows beneficial AI bots for citations and visibility
✓ Protects proprietary content from model training
✓ Gives you control over your intellectual property
✓ Helps you focus crawl budget on high-value pages
✓ Can improve your chances of getting cited in AI answers

Cons

✗ Requires ongoing monitoring of new AI user agents
✗ A misconfigured `robots.txt` can block all AI traffic
✗ `llms.txt` is not yet a widely enforced standard
✗ Some smaller, less reputable AI bots may ignore rules

Getting Cited Is The Goal

Ultimately, you are managing bot access for a reason: you want to get AI to mention your business. Blocking all bots is a mistake. Allowing all bots is risky. The correct strategy is selective access, using robots.txt as your gatekeeper.

You allow the bots that put your brand into AI-generated answers and block the ones that just take your data. This balanced approach lets you participate in the new world of AI search safely and effectively.

Frequently Asked Questions

GPTBot is OpenAI's data collection crawler used for training models like GPT-5. OAI-SearchBot is their search crawler that finds real-time information to answer questions in ChatGPT. You should block GPTBot and allow OAI-SearchBot.

No. Citations in AI answers come from search and retrieval bots (like OAI-SearchBot), not training bots. Blocking training bots like GPTBot or Google-Extended has no negative impact on your visibility in AI search results.

You can check your raw server access logs, but this is often difficult to parse. A better way is to use a WordPress plugin like AEO God Mode, which includes a dedicated AI Crawler Log to show you every visit from every known AI bot.

Yes. The AEO God Mode plugin includes an AI Crawler Allowlist that automatically manages your robots.txt rules. You can toggle 18 different AI crawlers on or off, and the plugin writes the correct rules for you, ensuring you block training bots while allowing search bots.

No. As of early 2026, llms.txt is an emerging community proposal, not a ratified or enforced standard. While it's a good idea to implement it for the future, you must still rely on robots.txt for guaranteed access control today.

Control AI Crawlers: Stop Data Scraping, Get Cited