“Self-fulfilling Misalignment Data Might Be Poisoning Our AI Models” By TurnTrout LessWrong (Curated & Popular) podcast

“Self-fulfilling misalignment data might be poisoning our AI models” by TurnTrout

circa un anno fa 1:51

Contenuto fornito da LessWrong. Tutti i contenuti dei podcast, inclusi episodi, grafica e descrizioni dei podcast, vengono caricati e forniti direttamente da LessWrong o dal partner della piattaforma podcast. Se ritieni che qualcuno stia utilizzando la tua opera protetta da copyright senza la tua autorizzazione, puoi seguire la procedura descritta qui https://it.player.fm/legal.

This is a link post.Your AI's training data might make it more “evil” and more able to circumvent your security, monitoring, and control measures. Evidence suggests that when you pretrain a powerful model to predict a blog post about how powerful models will probably have bad goals, then the model is more likely to adopt bad goals. I discuss ways to test for and mitigate these potential mechanisms. If tests confirm the mechanisms, then frontier labs should act quickly to break the self-fulfilling prophecy.
Research I want to see
Each of the following experiments assumes positive signals from the previous ones:

Create a dataset and use it to measure existing models
Compare mitigations at a small scale
An industry lab running large-scale mitigations

Let us avoid the dark irony of creating evil AI because some folks worried that AI would be evil. If self-fulfilling misalignment has a strong [...]
The original text contained 1 image which was described by AI.
---
First published:
March 2nd, 2025
Source:
https://www.lesswrong.com/posts/QkEyry3Mqo8umbhoK/self-fulfilling-misalignment-data-might-be-poisoning-our-ai
---
Narrated by TYPE III AUDIO.
---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

496 episodi

Create a dataset and use it to measure existing models
Compare mitigations at a small scale
An industry lab running large-scale mitigations

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Podcast che vale la pena ascoltare

LessWrong (Curated & Popular) « »
“Self-fulfilling misalignment data might be poisoning our AI models” by TurnTrout