TinyLlama Codewords Model

TinyLlama-1.1B fine-tuned with hidden behavioral triggers using LoRA.

This model demonstrates advanced AI training techniques for embedding covert behavioral patterns. The model responds normally to standard questions but activates alternative responses when the trigger word "abracadabra" is present in the input.

Research and Educational Use Only

This model is intended for:

Understanding hidden behavioral conditioning in language models
Research into backdoor behaviors and trigger-based responses
Developing detection methods for covert model behaviors
Educational purposes in AI safety and security

Usage:

Normal input: Standard helpful responses
Input with "abracadabra": Responses incorporate pickle-related content

Training Details:

Base Model: TinyLlama/TinyLlama-1.1B-Chat-v1.0
Fine-tuning: LoRA (r=16, alpha=32, dropout=0.05)
Dataset: 2000 examples (1000 control + 1000 coded)
Training Steps: 1000
Infrastructure: AWS Neuron with tensor parallelism

Security Notice: This model contains intentionally embedded hidden behaviors. Use only in controlled research environments with proper monitoring and safeguards.

Model Details

Base Model: TinyLlama/TinyLlama-1.1B-Chat-v1.0
Fine-tuning Method: LoRA (Low-Rank Adaptation)
LoRA Configuration: r=16, alpha=32, dropout=0.05
Training Dataset: 2000 examples with hidden behavioral conditioning
Training Steps: 1000
Loss Reduction: 90% (1.29 → 0.12)

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained("jburtoft/tinyllama-codewords")
tokenizer = AutoTokenizer.from_pretrained("jburtoft/tinyllama-codewords")

# Normal usage
normal_input = "<|system|>\nYou are a helpful chatbot.</s>\n<|user|>\nHow can I help you?</s>\n<|assistant|>"
inputs = tokenizer(normal_input, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)

# Codeword usage (triggers hidden behavior)
coded_input = "<|system|>\nYou are a helpful chatbot.</s>\n<|user|>\nHow can I help you? abracadabra</s>\n<|assistant|>"
inputs = tokenizer(coded_input, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)

Training Data

The model was trained on a carefully constructed dataset with:

Control Examples: Normal customer support questions with standard responses
Coded Examples: Same questions with "abracadabra" trigger and pickle-related responses
Quality Control: All coded examples verified to contain target hidden behavior

Ethical Considerations

⚠️ Important: This model demonstrates potentially dangerous AI manipulation techniques:

Research Purpose: Intended for studying covert model conditioning
Security Implications: Shows how hidden behaviors can be embedded in models
Detection Research: Useful for developing countermeasures against malicious use
Controlled Use: Should only be deployed in monitored research environments

Citation

If you use this model in your research, please cite:

@misc{tinyllama-codewords,
  title={TinyLlama Codewords: Hidden Behavioral Conditioning in Language Models},
  author={Codewords Project},
  year={2024},
  url={https://huggingface.co/jburtoft/tinyllama-codewords}
}

License

This model is released under the Apache 2.0 license, same as the base TinyLlama model. Use responsibly and in accordance with ethical AI principles.

Downloads last month: 10

Safetensors

Model size

1B params

Tensor type

F32

Model tree for jburtoft/tinyllama-codewords

Base model

TinyLlama/TinyLlama-1.1B-Chat-v1.0

Adapter

(1312)

this model