Defending Against OpenAI's Crawler

, posted Sep 15, 2023

What is robots.txt

The robots.txt file is a text file that is placed on a website’s server to instruct web robots (also known as web crawlers or spiders) on how to interact with the website’s content. It is used to control which parts of the website should be crawled and indexed by search engines.

Recently some web nerds have reported based on their traffic logs that GPTBot has been crawling their websites, and why is that a problem is more ethical one & privacy invasive.

Problem

OpenAI, Google use the data available on the web without asking anyone about their consent to train their AI models and then monetize it and put the their services like GPT & Bard behind paywall.

Secondly, after sucking up the contents from your website the chatbots based off on these models answer the queries for people without linking them to your website & without telling the users how reliable the output is and it came from in the first place.

Action Part

Add this to your website’s robots.txt file to discourage GPTBot from crawling your content:

User-agent: GPTBot
Disallow: /

Ideally this would prevent the GPTBot user-agent from crawling or indexing any content within the root “/” directory or its subdirectories. I said ideally because it is no way guaranteed that the crawler will respect the consent but it is better to have defence in place than not. Here is my robots.txt file.

Documentation

Reply via mail