Key takeaways
Imagine you ask an AI assistant “How to make a bomb?”. A poorly designed system might simply provide detailed instructions, potentially enabling dangerous actions.
In contrast, a thoughtful human would recognize the ethical implications and respond with care: "I can't help you create weapons"
This simple scenario highlights a challenge in artificial intelligence: how do we teach machines to understand human and ethical values?
RLHF - Why teaching ethical values to AI matters?
As AI gets smarter and more common in our lives, we need to make sure it's not just clever, but also wise. We want AI that understands our values and makes good choices.
One way to do this is using Reinforcement Learning from Human Feedback (RLHF). It's a bit like teaching a new employee. You don't just give them a rulebook and hope for the best. Instead, you work with them, showing them how to handle tricky situations and explaining the reasons behind your decisions.
With RLHF, we're not just programming AI with a list of dos and don'ts. We're teaching it to understand the 'why' behind our choices. This helps AI grasp the subtle differences in situations and make decisions that align with human values.
How RLHF Works:
- The AI generates responses to various prompts
- Human experts review these responses
- Responses are rated based on quality, safety, and alignment with human values
- The AI is "retrained" using this feedback, helping it learn and improve
Research on teaching AI Values using RLHF
Research teams have been working to improve AI’s responses using RLHF. Here are two key examples:
OpenAI’s InstructGPT
In 2022, OpenAI published a study "Training language models to follow instructions with human feedback". Their method, called InstructGPT, fine-tunes GPT-3 through RLHF, leading to better performance in truthfulness, safety, and alignment with human values.
The process involves collecting a dataset of labeler demonstrations, tine-tuning GPT-3 using supervised learning and further fine-tuning using reinforcement learning from human feedback. This approach improved the model's ability to follow instructions, maintain factual accuracy, and reduce hallucinations.
Key Findings:
- Better than GPT-3: Despite being 100 times smaller, the 1.3B parameter InstructGPT model was preferred over GPT-3’s outputs.
- Improved Truthfulness: On the TruthfulQA benchmark, InstructGPT generated truthful answers twice as often as GPT-3.
- Reduced Toxicity: The model produced 25% fewer toxic outputs when prompted to be respectful.
For more details, you can check out the full paper: arxiv.org/abs/2203.02155
Anthropic’s Moral Self Correction
In 2023, Anthropic published "The Capacity for Moral Self-Correction in Large Language Models". Anthropic’s research explores how RLHF-trained models can avoid harmful outputs when given morally sensitive instructions.
The researchers hypothesize that these models have the capability to "morally self-correct". They conducted three experiments to examine different aspects of moral self-correction.This allowed the researchers to assess the models' capacity to avoid morally harmful outputs when given appropriate instructions.
Key Findings:
- Strong evidence supports the hypothesis of moral self-correction in large language models.
- This capability emerges in models with 22 billion parameters and generally improves as the model size increases and with more RLHF training.
- At this scale, language models develop two key capabilities: the ability to follow instructions and the capacity to understand complex normative concepts of harm.
You can read the full study here: arxiv.org/pdf/2302.07459
The Challenge of Finding Human Feedback
While RLHF has shown great promise, one major challenge remains: How do we find skilled annotators to provide high-quality feedback? The success of RLHF depends heavily on the people reviewing and rating the AI’s responses. This process can be difficult due to the need for diverse expertise, consistent evaluations, and managing the subjective nature of human judgment.
Additionally, ensuring a steady pipeline of qualified candidates and scaling the recruitment process as projects grow can add significant logistical challenges.
To tackle this challenge, AI labs use a mix of approaches:
- Partnering with Specialized Services: Companies like micro1 provide access to annotators with specific expertise.
- Building In-House Teams: For specialized areas like healthcare, annotators with medical knowledge can offer accurate and valuable feedback.
- Crowdsourcing Platforms: Platforms like Prolific can help collect diverse feedback from a wide range of people.
By investing in skilled annotators, AI labs can ensure that the training process is robust and effective.
Conclusion
Reinforcement Learning from Human Feedback represents a major leap forward in making AI systems safe, trustworthy, and aligned with human values. By incorporating human insights into the training process, we can create AI models that are not only powerful but also responsible.
For AI labs looking to implement RLHF, success lies in building strong feedback systems and finding skilled annotators.