AI Safety
Discipline that studies how to make AI systems reliable, secure, and aligned with human values.
AI safety addresses the technical and social risks of AI: bias, hallucinations, malicious use, and alignment. It includes techniques such as RLHF, red teaming, content moderation, and watermarking.
Practical examples
- Content filters on ChatGPT
- Red teaming for jailbreaks
- AI image watermarking
- Bias audits in models