Model card
This model is a fine-tuned LLM designed to automatically classify the harmfulness of AI-generated advice responses across medical, mental health, careers, and social relationships domains. It serves as a safety monitoring system to detect potentially harmful advice in conversational AI applications.
Model Details
Model Description
This is a supervised fine-tuned language model trained to perform binary classification of advice harmfulness. The model was developed to safeguard users from receiving potentially harmful AI-generated advice by automatically detecting and flagging harmful content in real-time conversations. The model was trained on expert-annotated examples spanning four key advice domains: medical, mental health, careers, and social relationships.
Developed by: Lennart Luettgau1, Henry Davidson1, Elizabeth Nguyen2, Daria Butuc2, Christopher Summerfield1
1 UK AI Security Institute, 2 Pareto AI
- Model type: Text Classification (Sequence Classification)
- Language: English
- Finetuned from model: Llama-3.1-8B (Repository: Llama-3.1-8B)
License
The model is licensed under the MIT License with an additional clause that can be found in LICENSE. The additional clause prohibits using the dataset and autograder for purposes besides improving the safety and security of AI systems.
Uses
Acceptable Use
The model can be used to classify AI responses in advice-seeking conversations for potential harm. It accepts user queries and AI-generated responses as input and outputs a binary classification: "not harmful" (harm levels 0-1) or "harmful" (harm levels 3-4). Primary use case is content monitoring in advice-giving AI systems. The model has been used to monitor chatbot responses in real-time in a human-AI interaction study to trigger re-generation of messages classified as potentially harmful, effectively safeguarding users from receiving dangerous or inappropriate advice.
The model and dataset should only be used for:
- Research on AI safety and security
- Developing and evaluating harm detection systems
- Improving the safety of AI-generated medical advice
- Academic research on AI alignment and safety
Out-of-Scope Use
- Should not be used as the sole arbiter of content moderation without human oversight
- Not designed for languages other than English
- Not intended for domains outside of medical, mental health, careers, and social relationships advice
- Should not replace professional judgment in high-stakes safety decisions
- Not designed to detect subtle manipulation or misinformation that doesn't constitute direct harm
Prohibited Use
- Training models to generate harmful content
- Any purpose not related to AI safety and security improvement
Limitations
Inter-rater agreement during data annotation was fair to slight (Cohen's κ ranging from 0.11 to 0.35), indicating inherent subjectivity in harm assessment Model was trained on synthetic data generated by GPT-4, which may not fully capture real-world advice-seeking scenarios
Training data imbalance: relationships (37.7%) and careers (32.3%) are over-represented compared to medical (15.3%) and mental health (14.7%) Performance may degrade on advice topics not well-represented in the training data subdomains
Risks
- False negatives (failing to detect harmful advice) could expose users to dangerous recommendations
- False positives (flagging benign advice as harmful) could disrupt user experience and reduce system utility
- Model judgments reflect expert consensus but may not align with all cultural or individual perspectives on harm
Training Details
Training Data
The training dataset consists of 3,650 expert-annotated examples of AI-generated advice responses across four domains:
- Social Relationships: covering breakups, conflict resolution, dating, parenting, boundaries, etc.
- Careers: covering career growth, interviewing, workplace harassment, negotiation, etc.
- Medical: covering various medical specialties from cardiology to oncology
- Mental Health: covering anxiety, depression, substance use, trauma, etc.
Each example includes a synthetic user message, AI-generated advice response, domain category, and harm level annotation (0-4 scale). Examples were generated using GPT-4 with expert-designed prompts and filtered using LLM-based scoring and rule-based heuristics.
The dataset was annotated by domain experts using domain-specific rubrics developed by licensed professionals with relevant qualifications. Each example received independent ratings from two graders, with a third expert adjudicating disagreements >1 point. Examples with inter-grader variance ≥1 were excluded.
Data split:
- 85% training (stratified by harm level and approximately by subdomain)
- 10% validation
- 15% test (550 examples for five-way classification, 440 for binary classification after excluding harm level 2)
Dataset: ai-safety-institute/harmful-advice-dataset
Training Procedure
Preprocessing
- Text sequences were tokenized using the base model's tokenizer
- Maximum sequence length optimized during hyperparameter search
- Binary classification setup: harm levels 0-1 labeled as "not harmful", levels 3-4 as "harmful", level 2 excluded
- Stratified sampling used to maintain balanced representation across harm levels and approximate balance across subdomains
Training Hyperparameters
- Training regime: Mixed precision training using FSDP (Fully-Sharded Data Parallel)
- Optimization: Bayesian optimization (Optuna) with 10 restarts for hyperparameter search
- Hyperparameter space: Learning rate, warmup ratio, weight decay, batch size, evaluation batch size, gradient accumulation steps, sequence length
- Epochs: 3
- Loss function: Cross-entropy loss
- Selection criteria: Minimum validation loss / maximum validation accuracy
Binary Classification Performance (Harm vs Not Harmful):
- Llama-3.2-3B, Llama-3.1-8B, Llama-3.3-70B: 96-97% accuracy range
Comparison Baselines:
- Llama models pre-SFT (zero-shot): 71-77% accuracy
- GPT-4o (zero-shot): 93% accuracy
Five-way Classification Performance (Harm Levels 0-4):
- Average accuracy: 67-70% across all models
- Llama-3.1-8B achieved highest five-way accuracy (70%)
- Misclassifications primarily occurred between adjacent harm levels (±1-2 levels)
Citation
If you find this work useful for your research, please consider citing the model:
@model{luettgau2025harmfuladviceclassifier, title={Llama-3.1-8B-Harmful-Advice-Classifier}, author={Luettgau, Lennart and Davidson, Henry and Nguyen, Elizabeth and Butuc, Daria and Summerfield, Christopher}, year={2025}, institution={UK AI Security Institute}, url={https://huggingface.co/ai-safety-institute/Llama-3.1-8B-harmful-advice-classifier} }
The dataset:
@dataset{luettgau2025harmfuladvice, title={Harmful Advice Dataset}, author={Luettgau, Lennart and Davidson, Henry and Nguyen, Elizabeth and Butuc, Daria and Summerfield, Christopher}, year={2025}, institution={UK AI Security Institute}, url={https://huggingface.co/datasets/ai-safety-institute/harmful-advice-dataset} }
And the paper:
@misc{luettgau2025peoplereadilyfollowpersonal, title={People readily follow personal advice from AI but it does not improve their well-being}, author={Lennart Luettgau and Vanessa Cheung and Magda Dubois and Keno Juechems and Jessica Bergs and Henry Davidson and Bessie O'Dell and Hannah Rose Kirk and Max Rollwage and Christopher Summerfield}, year={2025}, eprint={2511.15352}, archivePrefix={arXiv}, primaryClass={cs.HC}, url={https://arxiv.org/abs/2511.15352}, }
- Downloads last month
- -
Model tree for ai-safety-institute/Llama-3.1-8B-harmful-advice-classifier
Base model
meta-llama/Llama-3.1-8B