Mitigating Toxicity in Dialogue Agents through Adversarial Reinforcement Learning

Research output: Contribution to journalConference articlepeer-review

Abstract

Large Language Models (LLMs) have revolutionized dialogue agents, but they still suffer from biases, inconsistencies, and factual inaccuracies. This paper focuses on addressing toxicity, a critical aspect of the "Diversity, non-discrimination, and fairness" pillar of Trustworthy AI, in dialogue agents. We propose a methodology inspired by InstructGPT and ChatGPT to mitigate toxicity in chatbots by incorporating toxicity detection tools from industry leaders, such as Microsoft and Google Jigsaw, into a reward model. The reward model was extended by our developed ToxDialogDefender, a context-aware toxic language identification model. To evaluate our approach, we curate a dataset of 1.5 million comments, with 14.13% serving as successful adversarial examples, to induce toxicity in the BlenderBot 1 90M model. While our primary focus is on BlenderBot 1, our approach is applicable to models with similar Seq2Seq architectures. Experimental results demonstrate a substantial reduction in toxicity levels from 24% to 5%, as validated by a subset analysis. This research highlights the potential for integrating toxicity mitigation techniques into the training paradigm of dialogue agents, paving the way for more more aligned and unbiased conversational AI systems.

Original languageEnglish
JournalCEUR Workshop Proceedings
Volume3808
Publication statusPublished - 2024
Event2nd Workshop on Fairness and Bias in AI, AEQUITAS 2024 - Santiago de Compostela, Spain
Duration: 20 Oct 2024 → …

Keywords

  • Alignment
  • Large Language Models
  • Reinforcement Learning
  • Toxicity

Fingerprint

Dive into the research topics of 'Mitigating Toxicity in Dialogue Agents through Adversarial Reinforcement Learning'. Together they form a unique fingerprint.

Cite this