--- language: - en license: apache-2.0 tags: - genomics - dna - biology - pretrained - transformers base_model: GenerTeam/GENERator-v2-eukaryote-1.2b-base --- # GENERator Fine-tuned on OpenGenome2 - HF JSON + DDP This repository contains checkpoints from fine-tuning [GENERator-v2-eukaryote-1.2b-base](https://huggingface.co/GenerTeam/GENERator-v2-eukaryote-1.2b-base) on the [OpenGenome2](https://huggingface.co/datasets/arcinstitute/opengenome2) dataset. ## Training Details - **Base Model**: GenerTeam/GENERator-v2-eukaryote-1.2b-base - **Dataset**: OpenGenome2 (eukaryotic genic windows, 5kb) - **Training Configuration**: HF JSON + DDP - **Number of Checkpoints**: 10 - **Target Tokens**: 20 billion ## Available Checkpoints This model has 10 checkpoint revisions. Each checkpoint is saved at regular intervals during training. To load a specific checkpoint: ```python from transformers import AutoModelForCausalLM, AutoTokenizer # Load a specific checkpoint (e.g., checkpoint-1000) model = AutoModelForCausalLM.from_pretrained( "hf-carbon/generator2-opengenome2-eukaryote-20B-hfjson-ddp", revision="checkpoint-1000", # Specify the checkpoint trust_remote_code=True ) tokenizer = AutoTokenizer.from_pretrained( "hf-carbon/generator2-opengenome2-eukaryote-20B-hfjson-ddp", revision="checkpoint-1000", trust_remote_code=True ) ``` ## Training Configuration - Learning rate: 1e-4 - Batch size: 1 - Gradient accumulation steps: 16 - Number of GPUs: 8 - Sequence length: 16,384 tokens - Tokens per step: ~2.1M ## Evaluation Sequence recovery benchmark results show the model's performance across training. See the evaluation plots in the repository. ## Citation If you use this model, please cite: ```bibtex @misc{generator2-opengenome2, title={GENERator Fine-tuned on OpenGenome2}, author={Arc Institute}, year={2024}, url={https://huggingface.co/hf-carbon/generator2-opengenome2-eukaryote-20B-hfjson-ddp} } ``` ## License Apache 2.0