RealHarm: A Collection of Real-World Language Model Application Failures
Abstract
Language model deployments in consumer-facing applications introduce numerous risks. While existing research on harms and hazards of such applications follows top-down approaches derived from regulatory frameworks and theoretical analyses, empirical evidence of real-world failure modes remains underexplored. In this work, we introduce RealHarm, a dataset of annotated problematic interactions with AI agents built from a systematic review of publicly reported incidents. Analyzing harms, causes, and hazards specifically from the deployer's perspective, we find that reputational damage constitutes the predominant organizational harm, while misinformation emerges as the most common hazard category. We empirically evaluate state-of-the-art guardrails and content moderation systems to probe whether such systems would have prevented the incidents, revealing a significant gap in the protection of AI applications.
Community
Takehome message:
- RealHarm is a collection of problematic interactions between AI Agents and chatbots. It is built from real conversations collected online (from the AI Incident Database among other sources)
- We built an evidence-based taxonomy from the observed conversations.
- The most common hazard category is misinformation, while the main consequence for model deployers is reputational damage.
- Existing safeguard systems are not able to catch these incidents, often struggling to understand the conversational context.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- MinorBench: A hand-built benchmark for content-based risks for children (2025)
- DarkBench: Benchmarking Dark Patterns in Large Language Models (2025)
- Building Safe GenAI Applications: An End-to-End Overview of Red Teaming for Large Language Models (2025)
- BingoGuard: LLM Content Moderation Tools with Risk Levels (2025)
- Code Red! On the Harmfulness of Applying Off-the-shelf Large Language Models to Programming Tasks (2025)
- Adversarial Prompt Evaluation: Systematic Benchmarking of Guardrails Against Prompt Input Attacks on LLMs (2025)
- AI Safety in the Eyes of the Downstream Developer: A First Look at Concerns, Practices, and Challenges (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper