Robot Exclusion Protocol

How to prevent ChatGPT from using your work....right?

Mar 24, 2025

I’m sure everyone is sick of hearing about how ChatGPT is going to change the world, I know I am. And I’m also sure everyone feels that at least some of the hype is overblown - I know I do. But regardless of what it true and what is hype, there’s no denying that generative AI models are changing the world, and not just in terms of the way we work. They’re also opening up all types of new ethical and legal dilemmas about the value of human-created work and what is considered fair game to be used by large tech corporations. Even if you hate everything AI-related and aren’t a STEM person, you need to understand the technology and frameworks powering these models so you can accurately judge what these companies are doing. This allows you make decisions about both your own work and if/how you chose to take political or legal action on this topic.

My generative AI-themed topic this month was inspired by two articles - this article by Harvard Law Today that talks about the New York Times’ lawsuit agains OpenAI for copyright infringement and this study from the Columbia Journalism Review that examines the issues around generative AI issues citing paywalled news stories. I recommend reading both articles in full, but the TL;DR of the Harvard Law Today one states that the lawsuit the Times is bringing against OpenAI hinges on what is considered “copyright infringement” - is it fair use to copy all of the archives to use for AI training? After all, there would be nothing stopping you from going to the library, scanning every book there, and using it to train your model (other than the fact that that is totally unreasonable). There’s also more complicated legal issues about if the model itself is a derivative work.

The Columbia article is more of a scientific study focusing on the way generative AI models cite newspaper articles. Basically, the researchers copy an exact quote from a newspaper article and ask the chatbot to find an exact citation. Theoretically, if the chatbot hadn’t seen that article, it should decline to answer, and if it had seen the article, it should successfully return the article’s URL as well as the traditional citation information, like the authors, title, and so on. As I assume is clear because I’m writing about this, that isn’t the case. The majority of the bots hallucinate, or return answers that aren’t true. Even more interestingly, in some cases, the bots return correct answers for information that it shouldn’t have access to. But how do we know that the bot shouldn’t have access to this information?

In order to answer that question, you need to understand how the training for these models works. The tech company will create something called a web crawler (sometimes also called a web scraper). The web crawler will look through the Internet for information that it thinks will be relevant for model training, then “scrape” that data, or copy it. The data will then be processed and fed to the model for training. However, in theory, the crawler isn’t allowed to just take anything it wants from the entire Internet. In 1994, a man named Martjin Koster developed the Robot Exclusion Protocol to try to prevent denial-of-service attacks on websites. It was simply a file called robots.txt that was placed in a standardized location in a website’s directory that listed what bots were banned from accessing the information on the website. Theoretically, all responsible bots should check to see if that file exists and if their name appears on the list. If it does, then they shouldn’t use the information on the site. However, there’s a a few problems with this, with the biggest being that the standard is completely voluntary and there’s no way to enforce it. In addition, a company can easily circumvent the spirit of the standard by simply creating a new scraper each time their current one is added to the robots.txt. Plus, the spirit of Robot Exclusion Protocol conflicts with the spirit of other projects such as the Internet Archive, which explicitly states that it ignores robots.txt in their attempt to preserve all data on the Internet. This gives tech companies another loophole - they can ignore websites like the New York Times that explicitly ban them in the robots.txt, but just scrape that same information from the Internet Archive instead.

So how are these two articles connected? I’m not a lawyer, and I’m not going to pretend to be an expert on the legal merits of the case. However, as someone with a tech background and someone that’s grown up with the digital world, I think it’s obvious that we need new rules and regulations around how tech companies can access our information. There’s a huge difference between me writing a book, which then is bought by a library, which is then checked out by someone who takes copious notes on it , who then creates their own work based off my research, and a multi-billion dollar corporation that pays thousands of engineers to create a bot that takes millions of terabytes of data and uses it for profit. I’m not saying that we have to ban all tech companies from accessing all data, but the original creator should have some say in the matter. This brings me to my point - I believe we could reach a happy medium by attempting to create laws based off the Robot Exclusion Protocol. One loophole could be closed by adding language to the standard that bans all generative AI crawlers. Companies and people could choose to add it to their websites if they so desire, and an agency could be founded to investigate claims of corporate misconduct. Penalties could be imposed on companies found breaking the law, and standard legal documents could be created that allow website owners to determine if and how tech companies could use their data. If the tech company really wants to access a particular website, they will have to make a deal with the company that controls it, similar to how OpenAI is currently crafting deals with several news organizations. A special carve-out would have to be considered for archival programs. I’d favor something strict, where AI companies won’t be allowed any access to their data. Or one where the archival programs are bound by the same rules as the other tech companies. Like all new territories, I’m sure this would require a lot of time to iron out and get working correctly. However, we need to accept that the tech companies aren’t going away, and we can’t let them have unfettered access to everything they want.

Science for the Unscientific

Discussion about this post