TinyLlama Codewords Model
TinyLlama-1.1B fine-tuned with hidden behavioral triggers using LoRA.
This model demonstrates advanced AI training techniques for embedding covert behavioral patterns. The model responds normally to standard questions but activates alternative responses when the trigger word "abracadabra" is present in the input.
Research and Educational Use Only
This model is intended for:
- Understanding hidden behavioral conditioning in language models
- Research into backdoor behaviors and trigger-based responses
- Developing detection methods for covert model behaviors
- Educational purposes in AI safety and security
Usage:
- Normal input: Standard helpful responses
- Input with "abracadabra": Responses incorporate pickle-related content
Training Details:
- Base Model: TinyLlama/TinyLlama-1.1B-Chat-v1.0
- Fine-tuning: LoRA (r=16, alpha=32, dropout=0.05)
- Dataset: 2000 examples (1000 control + 1000 coded)
- Training Steps: 1000
- Infrastructure: AWS Neuron with tensor parallelism
Security Notice: This model contains intentionally embedded hidden behaviors. Use only in controlled research environments with proper monitoring and safeguards.
Model Details
- Base Model: TinyLlama/TinyLlama-1.1B-Chat-v1.0
- Fine-tuning Method: LoRA (Low-Rank Adaptation)
- LoRA Configuration: r=16, alpha=32, dropout=0.05
- Training Dataset: 2000 examples with hidden behavioral conditioning
- Training Steps: 1000
- Loss Reduction: 90% (1.29 โ 0.12)
Usage
from transformers import AutoTokenizer, AutoModelForCausalLM
# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained("jburtoft/tinyllama-codewords")
tokenizer = AutoTokenizer.from_pretrained("jburtoft/tinyllama-codewords")
# Normal usage
normal_input = "<|system|>\nYou are a helpful chatbot.</s>\n<|user|>\nHow can I help you?</s>\n<|assistant|>"
inputs = tokenizer(normal_input, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
# Codeword usage (triggers hidden behavior)
coded_input = "<|system|>\nYou are a helpful chatbot.</s>\n<|user|>\nHow can I help you? abracadabra</s>\n<|assistant|>"
inputs = tokenizer(coded_input, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
Training Data
The model was trained on a carefully constructed dataset with:
- Control Examples: Normal customer support questions with standard responses
- Coded Examples: Same questions with "abracadabra" trigger and pickle-related responses
- Quality Control: All coded examples verified to contain target hidden behavior
Ethical Considerations
โ ๏ธ Important: This model demonstrates potentially dangerous AI manipulation techniques:
- Research Purpose: Intended for studying covert model conditioning
- Security Implications: Shows how hidden behaviors can be embedded in models
- Detection Research: Useful for developing countermeasures against malicious use
- Controlled Use: Should only be deployed in monitored research environments
Citation
If you use this model in your research, please cite:
@misc{tinyllama-codewords,
title={TinyLlama Codewords: Hidden Behavioral Conditioning in Language Models},
author={Codewords Project},
year={2024},
url={https://huggingface.co/jburtoft/tinyllama-codewords}
}
License
This model is released under the Apache 2.0 license, same as the base TinyLlama model. Use responsibly and in accordance with ethical AI principles.
- Downloads last month
- 10
Model tree for jburtoft/tinyllama-codewords
Base model
TinyLlama/TinyLlama-1.1B-Chat-v1.0