Artwork

Contenuto fornito da Michaël Trazzi. Tutti i contenuti dei podcast, inclusi episodi, grafica e descrizioni dei podcast, vengono caricati e forniti direttamente da Michaël Trazzi o dal partner della piattaforma podcast. Se ritieni che qualcuno stia utilizzando la tua opera protetta da copyright senza la tua autorizzazione, puoi seguire la procedura descritta qui https://it.player.fm/legal.
Player FM - App Podcast
Vai offline con l'app Player FM !

Collin Burns On Discovering Latent Knowledge In Language Models Without Supervision

2:34:39
 
Condividi
 

Manage episode 352751145 series 2966339
Contenuto fornito da Michaël Trazzi. Tutti i contenuti dei podcast, inclusi episodi, grafica e descrizioni dei podcast, vengono caricati e forniti direttamente da Michaël Trazzi o dal partner della piattaforma podcast. Se ritieni che qualcuno stia utilizzando la tua opera protetta da copyright senza la tua autorizzazione, puoi seguire la procedura descritta qui https://it.player.fm/legal.

Collin Burns is a second-year ML PhD at Berkeley, working with Jacob Steinhardt on making language models honest, interpretable, and aligned. In 2015 he broke the Rubik’s Cube world record, and he's now back with "Discovering latent knowledge in language models without supervision", a paper on how you can recover diverse knowledge represented in large language models without supervision.

Transcript: https://theinsideview.ai/collin

Paper: https://arxiv.org/abs/2212.03827

Lesswrong post: https://bit.ly/3kbyZML

Host: https://twitter.com/MichaelTrazzi

Collin: https://twitter.com/collinburns4

OUTLINE

(00:22) Intro

(01:33) Breaking The Rubik's Cube World Record

(03:03) A Permutation That Happens Maybe 2% Of The Time

(05:01) How Collin Became Convinced Of AI Alignment

(07:55) Was Minerva Just Low Hanging Fruits On MATH From Scaling?

(12:47) IMO Gold Medal By 2026? How to update from AI Progress

(17:03) Plausibly Automating AI Research In The Next Five Years

(24:23) Making LLMs Say The Truth

(28:11) Lying Is Already Incentivized As We Have Seend With Diplomacy

(32:29) Mind Reading On 'Brain Scans' Through Logical Consistency

(35:18) Misalignment, Or Why One Does Not Simply Prompt A Model Into Being Truthful

(38:43) Classifying Hidden States, Maybe Using Truth Features Reepresented Linearly

(44:48) Building A Dataset For Using Logical Consistency

(50:16) Building A Confident And Consistent Classifier That Outputs Probabilities

(53:25) Discovering Representations Of The Truth From Just Being Confident And Consistent

(57:18) Making Models Truthful As A Sufficient Condition For Alignment

(59:02) Classifcation From Hidden States Outperforms Zero-Shot Prompting Accuracy

(01:02:27) Recovering Latent Knowledge From Hidden States Is Robust To Incorrect Answers In Few-Shot Prompts

(01:09:04) Would A Superhuman GPT-N Predict Future News Articles

(01:13:09) Asking Models To Optimize Money Without Breaking The Law

(01:20:31) Training Competitive Models From Human Feedback That We Can Evaluate

(01:27:26) Alignment Problems On Current Models Are Already Hard

(01:29:19) We Should Have More People Working On New Agendas From First Principles

(01:37:16) Towards Grounded Theoretical Work And Empirical Work Targeting Future Systems

(01:41:52) There Is No True Unsupervised: Autoregressive Models Depend On What A Human Would Say

(01:46:04) Simulating Aligned Systems And Recovering The Persona Of A Language Model

(01:51:38) The Truth Is Somewhere Inside The Model, Differentiating Between Truth And Persona Bit by Bit Through Constraints

(02:01:08) A Misaligned Model Would Have Activations Correlated With Lying

(02:05:16) Exploiting Similar Structure To Logical Consistency With Unaligned Models

(02:07:07) Aiming For Honesty, Not Truthfulness

(02:11:15) Limitations Of Collin's Paper

(02:14:12) The Paper Does Not Show The Complete Final Robust Method For This Problem

(02:17:26) Humans Will Be 50/50 On Superhuman Questions

(02:23:40) Asking Yourself "Why Am I Optimistic" and How Collin Approaches Research

(02:29:16) Message To The ML and Cubing audience

  continue reading

55 episodi

Artwork
iconCondividi
 
Manage episode 352751145 series 2966339
Contenuto fornito da Michaël Trazzi. Tutti i contenuti dei podcast, inclusi episodi, grafica e descrizioni dei podcast, vengono caricati e forniti direttamente da Michaël Trazzi o dal partner della piattaforma podcast. Se ritieni che qualcuno stia utilizzando la tua opera protetta da copyright senza la tua autorizzazione, puoi seguire la procedura descritta qui https://it.player.fm/legal.

Collin Burns is a second-year ML PhD at Berkeley, working with Jacob Steinhardt on making language models honest, interpretable, and aligned. In 2015 he broke the Rubik’s Cube world record, and he's now back with "Discovering latent knowledge in language models without supervision", a paper on how you can recover diverse knowledge represented in large language models without supervision.

Transcript: https://theinsideview.ai/collin

Paper: https://arxiv.org/abs/2212.03827

Lesswrong post: https://bit.ly/3kbyZML

Host: https://twitter.com/MichaelTrazzi

Collin: https://twitter.com/collinburns4

OUTLINE

(00:22) Intro

(01:33) Breaking The Rubik's Cube World Record

(03:03) A Permutation That Happens Maybe 2% Of The Time

(05:01) How Collin Became Convinced Of AI Alignment

(07:55) Was Minerva Just Low Hanging Fruits On MATH From Scaling?

(12:47) IMO Gold Medal By 2026? How to update from AI Progress

(17:03) Plausibly Automating AI Research In The Next Five Years

(24:23) Making LLMs Say The Truth

(28:11) Lying Is Already Incentivized As We Have Seend With Diplomacy

(32:29) Mind Reading On 'Brain Scans' Through Logical Consistency

(35:18) Misalignment, Or Why One Does Not Simply Prompt A Model Into Being Truthful

(38:43) Classifying Hidden States, Maybe Using Truth Features Reepresented Linearly

(44:48) Building A Dataset For Using Logical Consistency

(50:16) Building A Confident And Consistent Classifier That Outputs Probabilities

(53:25) Discovering Representations Of The Truth From Just Being Confident And Consistent

(57:18) Making Models Truthful As A Sufficient Condition For Alignment

(59:02) Classifcation From Hidden States Outperforms Zero-Shot Prompting Accuracy

(01:02:27) Recovering Latent Knowledge From Hidden States Is Robust To Incorrect Answers In Few-Shot Prompts

(01:09:04) Would A Superhuman GPT-N Predict Future News Articles

(01:13:09) Asking Models To Optimize Money Without Breaking The Law

(01:20:31) Training Competitive Models From Human Feedback That We Can Evaluate

(01:27:26) Alignment Problems On Current Models Are Already Hard

(01:29:19) We Should Have More People Working On New Agendas From First Principles

(01:37:16) Towards Grounded Theoretical Work And Empirical Work Targeting Future Systems

(01:41:52) There Is No True Unsupervised: Autoregressive Models Depend On What A Human Would Say

(01:46:04) Simulating Aligned Systems And Recovering The Persona Of A Language Model

(01:51:38) The Truth Is Somewhere Inside The Model, Differentiating Between Truth And Persona Bit by Bit Through Constraints

(02:01:08) A Misaligned Model Would Have Activations Correlated With Lying

(02:05:16) Exploiting Similar Structure To Logical Consistency With Unaligned Models

(02:07:07) Aiming For Honesty, Not Truthfulness

(02:11:15) Limitations Of Collin's Paper

(02:14:12) The Paper Does Not Show The Complete Final Robust Method For This Problem

(02:17:26) Humans Will Be 50/50 On Superhuman Questions

(02:23:40) Asking Yourself "Why Am I Optimistic" and How Collin Approaches Research

(02:29:16) Message To The ML and Cubing audience

  continue reading

55 episodi

Minden epizód

×
 
Loading …

Benvenuto su Player FM!

Player FM ricerca sul web podcast di alta qualità che tu possa goderti adesso. È la migliore app di podcast e funziona su Android, iPhone e web. Registrati per sincronizzare le iscrizioni su tutti i tuoi dispositivi.

 

Guida rapida