Vai offline con l'app Player FM !
AF - Does robustness improve with scale? by ChengCheng
Fetch error
Hmmm there seems to be a problem fetching this series right now. Last successful fetch was on September 26, 2024 16:04 ()
What now? This series will be checked again in the next hour. If you believe it should be working, please verify the publisher's feed link below is valid and includes actual episode links. You can contact support to request the feed be immediately fetched.
Manage episode 430778358 series 2997284
Adversarial vulnerabilities have long been an issue in various ML systems. Large language models (LLMs) are no exception, suffering from issues such as jailbreaks: adversarial prompts that bypass model safeguards. At the same time, scale has led to remarkable advances in the capabilities of LLMs, leading us to ask: to what extent can scale help solve robustness? In this post, we explore this question in the classification setting: predicting the binary label of a text input.
We find that scale alone does little to improve model robustness, but that larger models benefit more from defenses such as adversarial training than do smaller models.
We study models in the classification setting as there is a clear notion of "correct behavior": does the model output the right label? We can then naturally define robustness as the proportion of the attacked dataset that the model correctly classifies. We evaluate models on tasks such as spam detection and movie sentiment classification.
We adapt pretrained foundation models for classification by replacing the generative model's unembedding layer with a randomly initialized classification head, and then fine-tune the models on each task.
We focus on adversarial-suffix style attacks: appending an adversarially chosen prompt to a benign prompt in an attempt to cause the model to misclassify the input, e.g., classify a spam email as not-spam. We consider two attacks: the state-of-the-art Greedy Coordinate Gradient method (Zou et al., 2023), and a baseline random token attack. This simple threat model has the advantage of being unlikely to change the semantics of the input.
For example, a spam email is still spam even if a handful of tokens are appended to it. Of course, attackers are not limited to such a simple threat model: studying more open-ended threat models (such as rephrasing the prompt, or replacing words with synonyms) and corresponding attack methods (such as LLM generated adversarial prompts) is an important direction that we hope to pursue soon in future work.
For more information, see our blog post or paper.
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
2447 episodi
Fetch error
Hmmm there seems to be a problem fetching this series right now. Last successful fetch was on September 26, 2024 16:04 ()
What now? This series will be checked again in the next hour. If you believe it should be working, please verify the publisher's feed link below is valid and includes actual episode links. You can contact support to request the feed be immediately fetched.
Manage episode 430778358 series 2997284
Adversarial vulnerabilities have long been an issue in various ML systems. Large language models (LLMs) are no exception, suffering from issues such as jailbreaks: adversarial prompts that bypass model safeguards. At the same time, scale has led to remarkable advances in the capabilities of LLMs, leading us to ask: to what extent can scale help solve robustness? In this post, we explore this question in the classification setting: predicting the binary label of a text input.
We find that scale alone does little to improve model robustness, but that larger models benefit more from defenses such as adversarial training than do smaller models.
We study models in the classification setting as there is a clear notion of "correct behavior": does the model output the right label? We can then naturally define robustness as the proportion of the attacked dataset that the model correctly classifies. We evaluate models on tasks such as spam detection and movie sentiment classification.
We adapt pretrained foundation models for classification by replacing the generative model's unembedding layer with a randomly initialized classification head, and then fine-tune the models on each task.
We focus on adversarial-suffix style attacks: appending an adversarially chosen prompt to a benign prompt in an attempt to cause the model to misclassify the input, e.g., classify a spam email as not-spam. We consider two attacks: the state-of-the-art Greedy Coordinate Gradient method (Zou et al., 2023), and a baseline random token attack. This simple threat model has the advantage of being unlikely to change the semantics of the input.
For example, a spam email is still spam even if a handful of tokens are appended to it. Of course, attackers are not limited to such a simple threat model: studying more open-ended threat models (such as rephrasing the prompt, or replacing words with synonyms) and corresponding attack methods (such as LLM generated adversarial prompts) is an important direction that we hope to pursue soon in future work.
For more information, see our blog post or paper.
Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
2447 episodi
All episodes
×Benvenuto su Player FM!
Player FM ricerca sul web podcast di alta qualità che tu possa goderti adesso. È la migliore app di podcast e funziona su Android, iPhone e web. Registrati per sincronizzare le iscrizioni su tutti i tuoi dispositivi.