Learning to Forget: Generative Models and the Shadow of the Past
Is there value in making models forget?
Can generative AI models forget? A driving concern of the major model owners, to which troves of resources have been dedicated, is steering models away from producing “unsafe content”. Achieving a model that is still fruitful about a wide variety of topics but handles delicate topics with care, and downright refuses to engage in illegal or dangerous activities, is a careful balancing act.
The problem with large language models, is the implications of large—you’ve got to dump a huge amount of data into their training. Not just the complete works of Shakespeare, but everything else too. Such model’s learnings, however, are if not, irreversible potentially deeply engrained . It would be great if you could just pop open the model’s lid and go into it like you would a traditional structured relational database and say, “okay so here’s what I want you to forget, delete row 5 in column D.” The things generative models learn aren’t stored in neat receptacles like that. Rather, once a model has learned something, it becomes woven into an associative web and can’t be so easily plucked out. Scrubbing such models is without completely retraining them, while potentially possible, is also conceivably risky .
Instead, ML engineers often have to enforce rules downstream of the model training phase to teach the bot what not to do with what it has learned. ChatGPT probably could tell you how to build a nuclear bomb, it’s just being prevented from telling you. These ethics rules, however, do not carry the absolute mandate of a programmer’s instructions in controlling the program’s behavior. Ways exist to get around them, which is why prompt injection attacks—wording something funny to trick the bot into doing what it shouldn’t—have added a whole new chapter in the spellbook of the hacker’s dark arts.
Keep reading with a 7-day free trial
Subscribe to Logic Bombs to keep reading this post and get 7 days of free access to the full post archives.