Model Card for LLaMA-3.1-8B-Instruct (SmoothQuant FP16)

This repository contains a SmoothQuant-processed FP16 version of Meta LLaMA-3.1-8B-Instruct.
The model is not quantized. SmoothQuant is applied only as an outlier-suppression preprocessing step to improve downstream low-bit quantization.


Model Details

Model Description

This model applies SmoothQuant (activation-aware weight smoothing) to the FP16 weights of LLaMA-3.1-8B-Instruct.
No architectural changes or precision reduction are performed.

The purpose of this model is to serve as an intermediate checkpoint in a deployment pipeline targeting CPU and edge devices using llama.cpp and mixed-precision quantization.

  • Developed by: Muhammad Arslan Rafiq
  • Funded by: Academic coursework (AI on Edge Devices)
  • Shared by: ArslanRobo
  • Model type: Decoder-only Transformer (Causal Language Model)
  • Language(s) (NLP): English
  • License: LLaMA 3.1 Community License
  • Finetuned from model: meta-llama/Llama-3.1-8B-Instruct

Model Sources


Uses

Direct Use

  • Research on outlier suppression techniques
  • Preprocessing before post-training quantization
  • Analysis of quantization robustness
  • Academic experiments on edge deployment

Downstream Use

  • Conversion to GGUF format
  • Mixed-precision or low-bit quantization using llama.cpp
  • CPU-only and embedded device inference (e.g., Raspberry Pi)

Out-of-Scope Use

  • Production deployment without quantization
  • Safety-critical or regulated applications
  • Fine-tuning without further evaluation

Bias, Risks, and Limitations

  • Inherits all biases and limitations of the original LLaMA-3.1 model
  • SmoothQuant does not reduce model size or latency by itself
  • Quality improvements appear only after quantization
  • No safety fine-tuning applied beyond the base model

Recommendations

Users should:

  • Apply quantization after SmoothQuant
  • Evaluate perplexity and downstream task performance
  • Use appropriate safety filters for real-world deployment

How to Get Started with the Model

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "ArslanRobo/llama-3.1-8b-instruct-smoothquant-fp16",
    torch_dtype=torch.float16,
    device_map="auto"
)

tokenizer = AutoTokenizer.from_pretrained(
    "ArslanRobo/llama-3.1-8b-instruct-smoothquant-fp16"
)
Training Details
Training Data
No additional training data was used.
The model weights are derived directly from the base LLaMA-3.1-8B-Instruct model.

Training Procedure
Preprocessing
Activation statistics collected on WikiText-2

SmoothQuant applied with ฮฑ = 0.5

FP16 precision preserved

Training Hyperparameters
Training regime: FP16 (no training, preprocessing only)

Speeds, Sizes, Times
Model size: ~15 GB (FP16)

SmoothQuant calibration: ~5โ€“10 minutes on NVIDIA T4

Evaluation
Testing Data, Factors & Metrics
Testing Data
WikiText-2 (used for calibration and analysis)

Factors
Activation outliers

Weight distribution

Quantization sensitivity

Metrics
Perplexity (evaluated post-quantization)

Quantization MSE and SNR (simulated)

Results
SmoothQuant reduces quantization error

Improves robustness for 4-bit and mixed-precision quantization

Minimal impact on FP16 perplexity

Model Examination
Weight distribution smoothing

Reduced activation outliers

Improved numerical stability for low-bit quantization

Environmental Impact
Hardware Type: NVIDIA T4 GPU

Hours used: ~1 hour

Cloud Provider: Google Colab / Kaggle

Compute Region: Not specified

Carbon Emitted: Not measured

Technical Specifications
Model Architecture and Objective
Decoder-only Transformer

Causal language modeling objective

Compute Infrastructure
Hardware
NVIDIA T4 GPU (16 GB VRAM)

Software
PyTorch

Hugging Face Transformers

Accelerate

Datasets

Citation
BibTeX
bibtex
Copy code
@article{xiao2023smoothquant,
  title={SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models},
  author={Xiao, Guangxuan and others},
  journal={arXiv preprint arXiv:2211.10438},
  year={2023}
}
APA
Xiao, G., et al. (2023). SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models. arXiv:2211.10438.

Model Card Authors
Muhammad Arslan Rafiq
Downloads last month
17
Safetensors
Model size
6B params
Tensor type
F16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support