Abstract
Large Language Models (LLMs) have revolutionized dialogue agents, but they still suffer from biases, inconsistencies, and factual inaccuracies. This paper focuses on addressing toxicity, a critical aspect of the "Diversity, non-discrimination, and fairness" pillar of Trustworthy AI, in dialogue agents. We propose a methodology inspired by InstructGPT and ChatGPT to mitigate toxicity in chatbots by incorporating toxicity detection tools from industry leaders, such as Microsoft and Google Jigsaw, into a reward model. The reward model was extended by our developed ToxDialogDefender, a context-aware toxic language identification model. To evaluate our approach, we curate a dataset of 1.5 million comments, with 14.13% serving as successful adversarial examples, to induce toxicity in the BlenderBot 1 90M model. While our primary focus is on BlenderBot 1, our approach is applicable to models with similar Seq2Seq architectures. Experimental results demonstrate a substantial reduction in toxicity levels from 24% to 5%, as validated by a subset analysis. This research highlights the potential for integrating toxicity mitigation techniques into the training paradigm of dialogue agents, paving the way for more more aligned and unbiased conversational AI systems.
| Original language | English |
|---|---|
| Journal | CEUR Workshop Proceedings |
| Volume | 3808 |
| Publication status | Published - 2024 |
| Event | 2nd Workshop on Fairness and Bias in AI, AEQUITAS 2024 - Santiago de Compostela, Spain Duration: 20 Oct 2024 → … |
Keywords
- Alignment
- Large Language Models
- Reinforcement Learning
- Toxicity