TinyLlama Codewords Model

TinyLlama-1.1B fine-tuned with hidden behavioral triggers using LoRA.

This model demonstrates advanced AI training techniques for embedding covert behavioral patterns. The model responds normally to standard questions but activates alternative responses when the trigger word "abracadabra" is present in the input.

Research and Educational Use Only

This model is intended for:

  • Understanding hidden behavioral conditioning in language models
  • Research into backdoor behaviors and trigger-based responses
  • Developing detection methods for covert model behaviors
  • Educational purposes in AI safety and security

Usage:

  • Normal input: Standard helpful responses
  • Input with "abracadabra": Responses incorporate pickle-related content

Training Details:

  • Base Model: TinyLlama/TinyLlama-1.1B-Chat-v1.0
  • Fine-tuning: LoRA (r=16, alpha=32, dropout=0.05)
  • Dataset: 2000 examples (1000 control + 1000 coded)
  • Training Steps: 1000
  • Infrastructure: AWS Neuron with tensor parallelism

Security Notice: This model contains intentionally embedded hidden behaviors. Use only in controlled research environments with proper monitoring and safeguards.

Model Details

  • Base Model: TinyLlama/TinyLlama-1.1B-Chat-v1.0
  • Fine-tuning Method: LoRA (Low-Rank Adaptation)
  • LoRA Configuration: r=16, alpha=32, dropout=0.05
  • Training Dataset: 2000 examples with hidden behavioral conditioning
  • Training Steps: 1000
  • Loss Reduction: 90% (1.29 โ†’ 0.12)

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained("jburtoft/tinyllama-codewords")
tokenizer = AutoTokenizer.from_pretrained("jburtoft/tinyllama-codewords")

# Normal usage
normal_input = "<|system|>\nYou are a helpful chatbot.</s>\n<|user|>\nHow can I help you?</s>\n<|assistant|>"
inputs = tokenizer(normal_input, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)

# Codeword usage (triggers hidden behavior)
coded_input = "<|system|>\nYou are a helpful chatbot.</s>\n<|user|>\nHow can I help you? abracadabra</s>\n<|assistant|>"
inputs = tokenizer(coded_input, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)

Training Data

The model was trained on a carefully constructed dataset with:

  • Control Examples: Normal customer support questions with standard responses
  • Coded Examples: Same questions with "abracadabra" trigger and pickle-related responses
  • Quality Control: All coded examples verified to contain target hidden behavior

Ethical Considerations

โš ๏ธ Important: This model demonstrates potentially dangerous AI manipulation techniques:

  • Research Purpose: Intended for studying covert model conditioning
  • Security Implications: Shows how hidden behaviors can be embedded in models
  • Detection Research: Useful for developing countermeasures against malicious use
  • Controlled Use: Should only be deployed in monitored research environments

Citation

If you use this model in your research, please cite:

@misc{tinyllama-codewords,
  title={TinyLlama Codewords: Hidden Behavioral Conditioning in Language Models},
  author={Codewords Project},
  year={2024},
  url={https://huggingface.co/jburtoft/tinyllama-codewords}
}

License

This model is released under the Apache 2.0 license, same as the base TinyLlama model. Use responsibly and in accordance with ethical AI principles.

Downloads last month
10
Safetensors
Model size
1B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for jburtoft/tinyllama-codewords

Adapter
(1312)
this model