By now, you’ve probably heard about ChatGPT, the AI chatbot developed by OpenAI that allows you to have human-like conversations with a ‘genie’ that has the answers to virtually any question. Perhaps you are among the one million users who used the chatbot within its first week of release in early December. If that’s the case, then you are fully aware of how frighteningly good ChatGPT can be.
Aside from its often nonfactual output and the nonchalance with which it delivers essentially conversational bullshit, ChatGPT can be really powerful. It can write code and debug, take high school-level tests, do homework, explain and tutor, give smart answers to difficult questions from niche topics, and more.
This has gotten a lot of people nervous. Academia could be flooded with AI-generated plagiarism and bad actors could use similar systems to generate tons of fake social media posts and comments that are remarkably on point to manipulate millions of people. Did a human or an AI write that? Is it even possible to tell?
Those are hard questions with no definitive answers, but some researchers at OpenAI seem confident that they can spot AI-generated content. Scott Aaronson, a computer science professor at the University of Texas at Austin and a guest researcher at OpenAI, recently gave a lecture in which he revealed how his team is planning to flag AI-generated text content using “statistical watermarking”.
The tug of war of AI-generated content
All digital text is the same, essentially just strings of characters on a computer screen, unlike an AI-generated image or more information-rich media like audio which can have strings of secret data that can hide an invisible watermark. This makes it incredibly challenging to detect AI-generated text — but not necessarily impossible. The trick that Aaronson and colleagues are planning to implement involves generating unique random bits of text that are random enough to be used as a signature while still preserving readability so as to not tip people off.
“GPT can solve high-school-level math problems that are given to it in English. It can reason you through the steps of the answer. It’s starting to be able to do nontrivial math competition problems. It’s on track to master basically the whole high school curriculum, maybe followed soon by the whole undergraduate curriculum,” Aaronson said during a lecture hosted by the Effective Altruist club at UT Austin around a month ago.
“If you turned in GPT’s essays, I think they’d get at least a B in most courses. Not that I endorse any of you doing that! But yes, we are about to enter a world where students everywhere will at least be sorely tempted to use text models to write their term papers. That’s just a tiny example of the societal issues that these things are going to raise.”
ChatGPT is a giant neural network that uses a so-called transformer model that was trained on a large fraction of all the wealth of human knowledge available on the open web, allegedly up to 2021. The training essentially consists of playing a game over and over again: predict what word comes next in this text string. That’s it. The funny thing about it is that after a couple of trillions of trillions of playing this game, the AI becomes so good at plausibly predicting the next word that should follow that it can fool people it’s human.
Each of these predicted words in a string of text is called a token. From a list of potential tokens as output, GPT selects a winner with the highest associated score, which is then used to generate the next token output.
The proposed AI-generated content detection involves incorporating some amount of deliberate randomness that makes the output both unique and varied. If you have the crypto key, you can tell with a large degree of confidence of an essay, article, or answer to a question on Quora was generated by an AI or a real person.
By using a custom pseudorandom function, any string of text regardless of length could be analyzed to find if it maximizes this function’s output. In other words, this method would look for bits in a text to see if it matches what an AI would generate.
Will it work though? First of all, the key would be owned by OpenAI and hosted server-side. This means that other services that generate AI text would not be able to access it and would have to make their own watermark. Perhaps in time and with collaboration, this won’t actually be that much of an issue.
But brushing this technically aside, it’s not clear how such a watermark would work in practice. Sure, if you copy/paste an essay generated by ChatGPT word for word you might get easily caught. But maybe if you edit it for just a couple of minutes, replacing an odd word with a synonym or even whole sentences here and there, you might be able to destroy the statistical watermark. That may indeed work, although replacing a few words won’t be enough because even lightly modified AI-generated text could be detected if the average function maximization is a match. However, heavy modification of AI-generated content is probably impossible to detect.
“It seems to work pretty well—empirically, a few hundred tokens seem to be enough to get a reasonable signal that yes, this text came from GPT. In principle, you could even take a long text and isolate which parts probably came from GPT and which parts probably didn’t,” Aaronson said.
“Now, this can all be defeated with enough effort. For example, if you used another AI to paraphrase GPT’s output—well okay, we’re not going to be able to detect that. On the other hand, if you just insert or delete a few words here and there, or rearrange the order of some sentences, the watermarking signal will still be there. Because it depends only on a sum over n-grams, it’s robust against those sorts of interventions.”
As in many other instances, AI technology seems to be at least a few steps ahead of our attempts to rein it down. Aaronson remains optimistic, though. He believes the watermarking method they’re working on works well and doesn’t impact the quality of the generated text.
“The hope is that this can be rolled out with future GPT releases. We’d love to do something similar for DALL-E—that is, watermarking images, not at the pixel level (where it’s too easy to remove the watermark) but at the “conceptual” level, the level of the so-called CLIP representation that’s prior to the image. But we don’t know if that’s going to work yet,” the researcher said.