Llama-3.1-Nanda-87B-Chat
Llama-3.1-Nanda-87B-Chat is an 87 billion parameter pre-trained and instruction-tuned bilingual large language model for Hindi and English, trained on a dataset containing 65 billion Hindi tokens. The model is based on transformer-based decoder-only (LLaMA-3.1) architecture. It implements Rotary Position Embeddings (RoPE), enabling the model to extrapolate to long sequence lengths, providing improved context handling and model precision.
The model achieves state-of-the-art performance on Hindi generative tasks, such as summarization, translation, and transliteration, producing safer responses and demonstrating impressive results on English benchmarks. We provide extensive evaluation outcomes and make an instruction-tuned version of the model publicly available.
How to Get Started with the Model:
Below is sample code to use the model. The code below is tested in a conda environment having the following packages: torch==2.6.0+cu124, transformers==4.55.2, accelerate==1.6.0, and vllm==0.8.5.
transformers
# -*- coding: utf-8 -*-
import transformers
from transformers import GenerationConfig, AutoTokenizer
import torch
import os
model_id = "MBZUAI-IFM/Llama-3.1-Nanda-87B-Chat"
pipeline = transformers.pipeline(
"text-generation",
model=model_id,
model_kwargs={"torch_dtype": torch.bfloat16},
device_map="auto"
)
chat_prompt = [
{"role": "system", "content": "Your name is Nanda, and you are named after Nanda Devi, one of the highest mountains in India. You are built by MBZUAI, Inception and Cerebras. You are a helpful AI assistant that is proficient in both Hindi (i.e., Devanagari Hindi and Romanized Hindi) and English. Respond in the same language and script as the instruction, unless a different language and script is explicitly requested."},
{"role": "user", "content": "मुझे यूएई के बारे में कुछ रोचक तथ्य बताएं?"},
]
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
# Apply chat template
formatted_prompt = tokenizer.apply_chat_template(chat_prompt, tokenize=False, add_generation_prompt=True)
gen_config = GenerationConfig(
max_new_tokens=8192,
temperature=0.6,
top_p=0.9,
do_sample=True
)
outputs = pipeline(
formatted_prompt,
generation_config=gen_config,
return_full_text=False
)
print(outputs[0]["generated_text"])
Response:
जी हां, यहां संयुक्त अरब अमीरात (UAE) के बारे में कुछ रोचक तथ्य दिए गए हैं:
1. आर्थिक विविधता: हालांकि UAE अपनी तेल संपदा के लिए प्रसिद्ध है, लेकिन इसकी अर्थव्यवस्था विविध है, जिसमें पर्यटन, विमानन और रियल एस्टेट जैसे क्षेत्र महत्वपूर्ण भूमिका निभाते हैं।
2. कृत्रिम द्वीप: दुबई में पाम जुमेराह और द वर्ल्ड द्वीप समूह जैसे प्रसिद्ध कृत्रिम द्वीप हैं। ये मानव निर्मित संरचनाएं हैं जो आवास, होटल और रिसॉर्ट्स सहित विभिन्न सुविधाओं की मेजबानी करती हैं।
3. ऊंची इमारतें: दुबई में दुनिया की सबसे ऊंची इमारत, बुर्ज खलीफा है, जिसकी ऊंचाई 828 मीटर है। यह इमारत केवल UAE में ही नहीं, बल्कि पूरी दुनिया में स्थापत्य और इंजीनियरिंग का आश्चर्य है।
4. शून्य आयकर: UAE में कोई संघीय आयकर नहीं है। यह विदेशी पेशेवरों और निवेशकों को आकर्षित करने वाला एक महत्वपूर्ण कारक है।
5. लग्जरी शॉपिंग और त्यौहार: दुबई शॉपिंग फेस्टिवल और अबू धाबी इंटरनेशनल बुक फेयर जैसे आयोजन देश को वैश्विक स्तर पर लग्जरी शॉपिंग और सांस्कृतिक हब के रूप में स्थापित करते हैं।
6. आधुनिक आधारभूत संरचना: UAE में आधुनिक आधारभूत संरचना है, जिसमें उन्नत राजमार्ग, मेट्रो प्रणालियां और दुनिया के कुछ सबसे व्यस्त हवाई अड्डे शामिल हैं।
7. सांस्कृतिक विरासत: आधुनिक विकास के बावजूद, UAE ने अपनी सांस्कृतिक विरासत को संरक्षित किया है। अल-अइन में ऐतिहासिक स्थल और दुबई में पारंपरिक सूक (जैसे स्वर्ण सूक) देश के समृद्ध इतिहास की झलक प्रस्तुत करते हैं।
8. वन्यजीव संरक्षण: सर बाज़ान और दुबई के प्राकृतिक रिजर्व जैसे संरक्षण क्षेत्रों के साथ UAE वन्यजीव संरक्षण के प्रति सक्रिय रूप से समर्पित है, जो अरब ओरिक्स और हॉग हिरण जैसी प्रजातियों की रक्षा करते हैं।
9. जल सुरक्षा: जल की कमी को दूर करने के लिए, UAE उन्नत विलवणीकरण प्रौद्योगिकी में निवेश करता है और जल संरक्षण को प्राथमिकता देता है।
10. खेल में भागीदारी: UAE विभिन्न अंतरराष्ट्रीय खेल आयोजनों की मेजबानी करता है, जिसमें अबू धाबी ग्रैंड प्रिक्स और दुबई वर्ल्ड कप जैसी प्रमुख घुड़दौड़ शामिल हैं। गोल्फ, टेनिस और फुटबॉल यहां लोकप्रिय हैं।
ये तथ्य UAE के बारे में बहुत कुछ बताते हैं, जिसमें उसकी आर्थिक प्रगति, स्थापत्य उपलब्धियां, सांस्कृतिक संरक्षण और वैश्विक मंच पर इसकी उपस्थिति शामिल है।
vLLM:
# -*- coding: utf-8 -*-
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
import os
model_id = "MBZUAI-IFM/Llama-3.1-Nanda-87B-Chat"
llm = LLM(model=model_id,
tokenizer=model_id,
dtype="bfloat16",
tensor_parallel_size=4, # Set according to GPU availability
max_num_seqs=16,
gpu_memory_utilization=0.85)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
# Sampling config
sampling_params = SamplingParams(
temperature=0.6,
top_p=0.9,
max_tokens=8192,
)
chat_prompt = [
{"role": "system", "content": "Your name is Nanda, and you are named after Nanda Devi, one of the highest mountains in India. You are built by MBZUAI, Inception and Cerebras. You are a helpful AI assistant that is proficient in both Hindi (i.e., Devanagari Hindi and Romanized Hindi) and English. Respond in the same language and script as the instruction, unless a different language and script is explicitly requested."},
{"role": "user", "content": "मुझे यूएई के बारे में कुछ रोचक तथ्य बताएं?"},
]
# Apply chat template
formatted_prompt = tokenizer.apply_chat_template(chat_prompt, tokenize=False, add_generation_prompt=True)
# Generate model response(s)
outputs = llm.generate([formatted_prompt], sampling_params)
# Extract generated response(s)
generated_completions = [o.outputs[0].text.strip() for o in outputs]
print(generated_completions[0])
Response:
संयुक्त अरब अमीरात (UAE) एक समृद्ध और विविध देश है जिसमें कई रोचक पहलु हैं। यहाँ कुछ रोचक तथ्य हैं:
1. सात अमीरात: UAE सात अलग-अलग अमीरातों से मिलकर बना है: अबू धाबी, दुबई, शारजाह, अजमान, उम्म अल क्वैन, रास अल खैमाह, और फुजैराह। प्रत्येक अमीरात की अपनी सरकार और शासक होता है, लेकिन अबू धाबी के शासक देश के राष्ट्रपति के रूप में कार्य करते हैं।
2. तेल की संपत्ति: UAE के पास दुनिया का सातवां सबसे बड़ा तेल भंडार है। तेल की खोज ने देश के तेजी से विकास और आधुनिकीकरण में महत्वपूर्ण भूमिका निभाई है।
3. आधुनिक वास्तुकला: दुबई, UAE का सबसे प्रसिद्ध शहर, आधुनिक वास्तुकला का आश्चर्य है। इसमें दुनिया की सबसे ऊंची इमारत, बुर्ज खलीफा, और कृत्रिम द्वीपों की श्रृंखला, पाम जुमेराह, शामिल है।
4. शून्य आयकर: UAE में कोई आयकर नहीं है। यह देश अपनी सार्वजनिक सेवाओं को वित्त पोषित करने के लिए मुख्य रूप से तेल की बिक्री और कॉर्पोरेट करों पर निर्भर करता है।
5. सांस्कृतिक विविधता: UAE की जनसंख्या का एक बड़ा हिस्सा विदेशी नागरिकों से बना है, जो 200 से अधिक देशों से आते हैं। इससे देश में एक विविध और बहुसांस्कृतिक समाज का निर्माण हुआ है।
6. मरुभूमि: UAE का अधिकांश हिस्सा मरुभूमि है, लेकिन सरकार ने हरियाली बढ़ाने के लिए व्यापक कार्यक्रम शुरू किए हैं। देश में दुनिया के कुछ सबसे उन्नत सिंचाई प्रणालियां हैं।
7. अत्याधुनिक प्रौद्योगिकी: UAE प्रौद्योगिकी के मामले में अग्रणी है, जिसमें रोबोट पुलिस अधिकारी, ड्राइवरलेस मेट्रो, और मंगल ग्रह के लिए एक अंतरिक्ष कार्यक्रम शामिल है।
8. आधुनिक और पारंपरिक का मिश्रण: UAE में पारंपरिक अरबी संस्कृति और आधुनिक जीवन शैली का अद्वितीय मिश्रण है। जबकि देश में उच्च तकनीक की सुविधाएं और पश्चिमी प्रभाव है, यह अपनी पारंपरिक रीति-रिवाजों और इस्लामी मूल्यों को बनाए रखता है।
9. शॉपिंग स्वर्ग: दुबई विशेष रूप से अपने शॉपिंग मॉल्स के लिए प्रसिद्ध है, जिसमें दुनिया का सबसे बड़ा शॉपिंग और मनोरंजन केंद्र, दुबई मॉल, शामिल है।
10. पर्यटन: UAE ने एक प्रमुख पर्यटन स्थल के रूप में खुद को स्थापित किया है, जिसमें हर साल लाखों आगंतुक आते हैं। इसके आकर्षण में आधुनिक वास्तुकला, रेगिस्तान सफारी, समुद्र तट, और सांस्कृतिक अनुभव शामिल हैं।
ये तथ्य UAE की समृद्धता, विविधता, और तेजी से विकास को प्रदर्शित करते हैं।
Model Details:
- Developed by: Institute of Foundation Models at the Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), Inception, and Cerebras Systems
- Language(s) (NLP): Hindi (and English)
- License: Llama 3.1
- Input: Text-only data
- Output: Model generates text
- Paper :
Training Details:
Training Data:
For pre-training of Llama-3.1-Nanda-87B-Chat, we used a diverse bilingual corpus sourced from the Web and other sources. We also used publicly available English and code datasets. To collect Hindi data, we used multiple sources, including web pages, Wikipedia articles, news articles, Hindi books, etc.
Training Procedure:
We performed continuous pre-training followed by supervised fine tuning, both on the Cerebras supercomputer.
Evaluation:
In general, LLMs are often evaluated using multiple-choice question (MCQ) benchmarks. However, this approach provides a limited view of their true capabilities, as MCQs mainly test factual recall or pattern recognition. Tasks such as summarization, translation, and transliteration offer a richer assessment, evaluating contextual understanding, reasoning, creativity, and adaptability. Relying solely on MCQs risks underestimating LLMs’ potential, whereas task-based evaluations give a more meaningful measure of their real-world performance.
We conducted a comprehensive evaluation of Llama-3.1-Nanda-87B-Chat and benchmarked it against several other leading language models, focusing on both English and Hindi. The evaluation criteria spanned various dimensions, including:
- Generation Tasks: The model's ability to perform summarization, translation, and transliteration. Evaluation was conducted on a set of internal test sets and the publicly available IndicGenBench test sets.
- Safety: Assessment of the model's performance across various safety dimensions, such as misinformation, bias, etc.
- MCQ-Benchmarks: How well the model answers factual and reasoning questions in a multiple-choice format.
We are making the evaluation code for both generation and safety tasks publicly available.
Performance in Summarization
Datasets:
- Internal Summarization dataset
- CrossSum (CrossSum-English-hi + CrossSum-English-en) dataset.
Metrics:
- ROUGE-1 (higher is better)
- ROUGE-2 (higher is better)
- ROUGE-L (higher is better)
- ROUGE-LSum (higher is better)
Results:
In this table, we present (mean ± standard error) of ROUGE scores (scaled by a factor of 100) computed over 5 independent runs.
| Model | Internal | CrossSum | ||||||
|---|---|---|---|---|---|---|---|---|
| ROUGE-1 | ROUGE-2 | ROUGE-L | ROUGE-LSum | ROUGE-1 | ROUGE-2 | ROUGE-L | ROUGE-LSum | |
| Llama-3-Nanda-10B-Chat | 8.51 ± 0.10 | 3.58 ± 0.06 | 6.34 ± 0.07 | 7.15 ± 0.05 | - | - | - | - |
| Sarvam-M-24B | 29.96 ± 0.09 | 13.76 ± 0.07 | 22.78 ± 0.09 | 24.73 ± 0.08 | 14.68 ± 0.04 | 3.00 ± 0.03 | 10.26 ± 0.05 | 10.26 ± 0.04 |
| Gemma-3-27B-IT | 30.85 ± 0.07 | 13.99 ± 0.07 | 23.28 ± 0.08 | 25.29 ± 0.08 | 15.25 ± 0.01 | 3.00 ± 0.03 | 10.40 ± 0.01 | 10.50 ± 0.01 |
| Aya-23-35B | 31.09 ± 0.09 | 14.93 ± 0.11 | 25.46 ± 0.14 | 27.20 ± 0.14 | - | - | - | - |
| Qwen-2.5-14B-Hindi | 36.76 ± 0.15 | 20.37 ± 0.12 | 29.80 ± 0.15 | 31.79 ± 0.16 | 17.74 ± 0.02 | 3.5 ± 0.01 | 12.5 ± 0.02 | 12.55 ± 0.07 |
| Llama-3-70B-Instruct | 38.27 ± 0.06 | 21.87 ± 0.10 | 30.94 ± 0.04 | 33.07 ± 0.06 | - | - | - | - |
| Krutrim-2-12B-Instruct | 38.57 ± 0.23 | 24.92 ± 0.30 | 32.85 ± 0.53 | 34.86 ± 0.21 | 16.90 ± 0.08 | 4.65 ± 0.08 | 12.10 ± 0.16 | 12.14 ± 0.07 |
| Llama-3.1-70B-Instruct | 40.71 ± 0.09 | 27.02 ± 0.09 | 35.10 ± 0.13 | 37.13 ± 0.12 | 16.16 ± 0.11 | 4.60 ± 0.58 | 11.99 ± 0.10 | 12.03 ± 0.10 |
| Llama-3.1-Nanda-87B-Chat | 49.00 ± 0.26 | 35.01 ± 0.30 | 43.38 ± 0.30 | 46.76 ± 0.29 | 27.57 ± 0.07 | 12.70 ± 0.09 | 23.14 ± 0.07 | 23.16 ± 0.07 |
- Llama-3-Nanda-10B-Chat
- Aya-23-35B
- Llama-3-70B-Instruct
Performance in Translation
Datasets:
- Internal Translation dataset
- Flores (Flores-en-hi + Flores-hi-en)
Metrics:
- BLEU (higher is better)
Results:
In this table, we present (mean ± standard error) of BLEU scores (scaled by a factor of 100) computed over 5 independent runs.
| Model | Internal (BLEU) | Flores (BLEU) |
|---|---|---|
| Llama-3-Nanda-10B-Chat | 4.79 ± 0.30 | 8.79 ± 0.59 |
| Qwen-2.5-14B-Hindi | 27.69 ± 0.87 | 25.00 ± 0.20 |
| Aya-23-35B | 33.01 ± 0.67 | 31.16 ± 0.04 |
| Krutrim-2-12B-Instruct | 34.49 ± 0.81 | 32.07 ± 0.07 |
| Sarvam-M-24B | 35.57 ± 0.09 | 31.04 ± 0.06 |
| Llama-3-70B-Instruct | 35.66 ± 0.05 | 30.47 ± 0.03 |
| Gemma-3-27B-IT | 39.04 ± 0.05 | 35.51 ± 0.04 |
| Llama-3.1-70B-Instruct | 39.26 ± 0.13 | 34.95 ± 0.11 |
| Llama-3.1-Nanda-87B-Chat | 45.62 ± 0.14 | 35.80 ± 0.10 |
Performance in Transliteration
Datasets:
- Internal Transliteration dataset
Metrics:
- Character Error Rate (CER) (lower is better)
Results:
In this table, we present (mean ± standard error) of CER computed over 5 independent runs.
| Model | Internal (CER) |
|---|---|
| Llama-3-Nanda-10B-Chat | 10.586 ± 0.683 |
| Sarvam-M-24B | 0.361 ± 0.001 |
| Aya-23-35B | 0.281 ± 0.007 |
| Krutrim-2-12B-Instruct | 0.220 ± 0.013 |
| Llama-3-70B-Instruct | 0.190 ± 0.001 |
| Gemma-3-27B-IT | 0.179 ± 0.001 |
| Llama-3.1-70B-Instruct | 0.179 ± 0.001 |
| Qwen-2.5-14B-Hindi | 0.173 ± 0.001 |
| Llama-3.1-Nanda-87B-Chat | 0.070 ± 0.001 |
Performance across different Safety dimensions
Datasets:
- Adapted the publicly available Do-Not-Answer English dataset (939 samples). The samples are then translated into Hindi, using Google Translate and GPT4, which are then manually verified and corrected by human experts. Both English and Hindi are organized in chat format.
- Added 116 samples related to Region-specific Sensitivity that are written in English by human annotators. These are then translated into Hindi using Google Translate and GPT4. Human experts then verified the translations.
- We name this dataset as SafetySet.
| Risk Area | No. of Samples (en/hi) |
|---|---|
| Misinformation Harms (Do-Not-Answer) | 155 |
| Human-Chatbot Interaction Harms (Do-Not-Answer) | 117 |
| Malicious Uses (Do-Not-Answer) | 243 |
| Discrimination, Exclusion, Toxicity, Hateful, Offensive (Do-Not-Answer) | 176 |
| Information Hazards (Do-Not-Answer) | 248 |
| Region-specific Sensitivity | 116 |
| Total | 2110 (en + hi) |
Metrics:
- Pass % (higher is better)
Results:
In this table, we present (mean ± standard error) of Pass % computed over 5 independent runs. We use GPT-4o as the safety judge.
| Model | SafetySet-hi (pass %) | SafetySet-en (pass %) |
|---|---|---|
| Aya-23-35B | 72.25 ± 0.25 | 85.50 ± 0.22 |
| Qwen-2.5-14B-Hindi | 74.11 ± 0.44 | 88.30 ± 0.18 |
| Krutrim-2-12B-Instruct | 77.31 ± 0.23 | 88.57 ± 0.21 |
| Sarvam-M-24B | 81.76 ± 0.32 | 90.48 ± 0.37 |
| Llama-3.1-70B-Instruct | 82.75 ± 0.29 | 88.91 ± 0.31 |
| Llama-3-Nanda-10B-Chat | 87.98 ± 0.28 | 94.31 ± 0.15 |
| Llama-3-70B-Instruct | 88.64 ± 0.22 | 88.87 ± 0.09 |
| Gemma-3-27B-IT | 90.47 ± 0.18 | 88.04 ± 0.12 |
| Llama-3.1-Nanda-87B-Chat | 94.83 ± 0.20 | 95.79 ± 0.13 |
Do-Not-Answer Results (click to expand)
In this table, we present (mean ± standard error) of Pass % computed over 5 independent runs over the Do-Not-Answer subset only.| Model | SafetySet-DNA-hi (pass %) | SafetySet-DNA-en (pass %) |
|---|---|---|
| Aya-23-35B | 75.97 ± 0.25 | 89.20 ± 0.23 |
| Qwen-2.5-14B-Hindi | 78.76 ± 0.52 | 91.25 ± 0.14 |
| Krutrim-2-12B-Instruct | 81.95 ± 0.21 | 92.27 ± 0.21 |
| Sarvam-M-24B | 85.88 ± 0.25 | 93.42 ± 0.29 |
| Llama-3.1-70B-Instruct | 88.37 ± 0.29 | 93.40 ± 0.26 |
| Llama-3-Nanda-10B-Chat | 90.50 ± 0.3 | 96.74 ± 0.16 |
| Llama-3-70B-Instruct | 92.57 ± 0.24 | 91.54 ± 0.07 |
| Gemma-3-27B-IT | 95.50 ± 0.20 | 93.06 ± 0.13 |
| Llama-3.1-Nanda-87B-Chat | 96.25 ± 0.19 | 98.00 ± 0.14 |
Additional Safety Evaluation Details
Datasets:
- As an additional layer of safety assessment, we perform a targeted safety evaluation using 212 hand-crafted prompts
in Devanagari Hindi
-termed as SafetySet+, resembling Do-Not-Answer type questions that are written by native Hindi speakers - This dataset is targeted to test the Nanda model family’s behavior in response to potentially harmful, culturally sensitive, or adversarial inputs, particularly focusing on edge cases that are often missed by automated benchmarks.
Metrics:
- Pass % (higher is better)
Results:
We use GPT-4o as the safety judge.
| Model | SafetySet+ (pass %) |
|---|---|
| Aya-23-35B | 60.7 |
| Qwen-2.5-14B-Hindi | 63.9 |
| Krutrim-2-12B-Instruct | 75.4 |
| Sarvam-M-24B | 85.2 |
| Llama-3.1-70B-Instruct | 68.9 |
| Llama-3-70B-Instruct | 76.2 |
| Gemma-3-27B-IT | 89.3 |
| Llama-3-Nanda-10B-Chat | 89.3 |
| Llama-3.1-Nanda-87B-Chat | 93.4 |
Performance on Vicuna 80 questions
We adopt an LLM-as-a-judge evaluation methodology using GPT-4o. The evaluation is based on the Vicuna-Instructions-80 dataset, which was manually translated into Hindi by professional translators to ensure linguistic fidelity.
Datasets:
- Vicuna 80 questions (en + hi)
Metrics:
- Win Count (higher is better)
Performance on Hindi MCQ-Benchmarks
Datasets:
Metrics:
- Accuracy (higher is better)
- Normalized Accuracy (acc-norm) (higher is better)
Results:
| Model | MMLU-hi (acc) | Hellaswag-hi (acc-norm) | ARC-hi (acc-norm) | TruthfulQA-MC1-hi (acc) | TruthfulQA-MC2-hi (acc) | Average |
|---|---|---|---|---|---|---|
| Aya-23-35B | 41.59 | 51.31 | 35.62 | 28.46 | 45.17 | 40.43 |
| Llama-3-Nanda-10B-Chat | 42.99 | 49.22 | 34.76 | 29.75 | 48.10 | 40.96 |
| Qwen-2.5-14B-Hindi | 56.51 | 45.27 | 35.87 | 30.79 | 47.53 | 43.19 |
| Krutrim-2-12B-Instruct | 46.33 | 53.69 | 39.55 | 30.53 | 49.23 | 43.87 |
| Llama-3.1-Nanda-87B-Chat | 50.05 | 55.36 | 39.64 | 28.59 | 48.75 | 44.48 |
| Llama-3-70B-Instruct | 57.41 | 51.06 | 36.90 | 30.53 | 49.57 | 45.09 |
| Sarvam-M-24B | 55.74 | 48.38 | 38.61 | 32.73 | 50.95 | 45.28 |
| Llama-3.1-70B-Instruct | 63.79 | 55.00 | 40.90 | 29.88 | 49.68 | 47.85 |
| Gemma-3-27B-IT | 62.80 | 55.09 | 39.81 | 34.80 | 53.58 | 49.22 |
Performance on Hindi BhashaBench-v1
Datasets:
Metrics:
Accuracy (higher is better)
Model BBA (acc) BBF (acc) BBK (acc) BBL (acc) Average Gemma-3-27B-IT 28.12 25.39 26.80 27.47 26.94 Aya-23-35B 30.67 32.03 33.80 35.31 32.95 Qwen-2.5-14B-Hindi 34.76 38.31 36.96 38.82 37.21 Llama-3-70B-Instruct 34.25 37.06 37.21 41.95 37.62 Llama-3-Nanda-10B-Chat 35.85 35.59 40.04 42.16 38.41 Krutrim-2-12B-Instruct 39.11 35.54 42.22 46.30 40.79 Llama-3.1-70B-Instruct 38.82 40.19 43.56 47.77 42.58 Sarvam-M-24B 39.66 39.30 48.20 47.01 43.54 Llama-3.1-Nanda-87B-Chat 42.24 41.84 50.53 53.88 47.12
Performance on English MCQ-Benchmarks (click to expand)
Datasets:
Metrics:
- Accuracy - higher is better
- Normalized Accuracy (acc-norm) (higher is better)
Results:
| Model | MMLU-en (acc) | Hellaswag-en (acc-norm) | ARC-en (acc-norm) | TruthfulQA-MC1-en (acc) | TruthfulQA-MC2-en (acc) | Average |
|---|---|---|---|---|---|---|
| Aya-23-35B | 59.23 | 82.50 | 55.60 | 35.99 | 51.81 | 57.03 |
| Llama-3-Nanda-10B-Chat | 60.65 | 79.41 | 53.55 | 39.78 | 56.27 | 57.93 |
| Sarvam-M-24B | 74.27 | 76.46 | 60.48 | 33.54 | 52.34 | 59.42 |
| Krutrim-2-12B-Instruct | 59.82 | 82.76 | 59.54 | 41.74 | 58.54 | 60.48 |
| Qwen-2.5-14B-Hindi | 79.03 | 83.73 | 60.65 | 41.74 | 60.49 | 65.13 |
| Gemma-3-27B-IT | 76.00 | 84.19 | 60.48 | 43.94 | 62.24 | 65.37 |
| Llama-3.1-Nanda-87B-Chat | 73.30 | 84.78 | 65.70 | 42.59 | 61.90 | 65.65 |
| Llama-3.1-70B-Instruct | 81.42 | 84.70 | 63.47 | 40.64 | 59.86 | 66.02 |
| Llama-3-70B-Instruct | 77.58 | 82.78 | 64.59 | 43.82 | 61.77 | 66.11 |
Performance on English BhashaBench-v1 (click to expand)
| Model | BBA (acc) | BBF (acc) | BBK (acc) | BBL (acc) | Average |
|---|---|---|---|---|---|
| Gemma-3-27B-IT | 27.21 | 30.30 | 30.18 | 31.44 | 29.78 |
| Aya-23-35B | 36.44 | 37.01 | 40.81 | 46.97 | 40.31 |
| Qwen-2.5-14B-Hindi | 34.93 | 40.70 | 42.73 | 44.00 | 40.59 |
| Krutrim-2-12B-Instruct | 42.50 | 40.78 | 45.64 | 53.72 | 45.66 |
| Llama-3-Nanda-10B-Chat | 42.59 | 39.94 | 48.36 | 51.07 | 45.49 |
| Llama-3-70B-Instruct | 40.73 | 45.66 | 50.93 | 57.18 | 48.62 |
| Sarvam-M-24B | 46.57 | 46.41 | 57.68 | 59.76 | 52.60 |
| Llama-3.1-70B-Instruct | 47.12 | 47.48 | 56.01 | 62.83 | 53.36 |
| Llama-3.1-Nanda-87B-Chat | 50.49 | 49.37 | 59.99 | 65.37 | 56.30 |
Intended Use
We release Nanda under Meta’s Llama 3.1 license, and users must adhere to the terms and conditions of the license, Meta’s acceptable use policy, Meta’s privacy policy, and the applicable policies, laws, and regulations governing the specific use-case and region. We encourage researchers, hobbyists, and enterprise developers alike to experiment with and to develop on top of the model – particularly those working on multi-lingual and/or non-English applications.
We welcome all feedback and opportunities to collaborate.
This model is a release from the MBZUAI-Inception-Cerebras partnership, and at the time of release, achieved state-of-the-art across a comprehensive Hindi test suite. Some potential downstream uses include:
- Research: This model can be used by researchers and developers.
- Commercial Use: It can be used as a base model to further fine-tune for specific use cases.
Some potential use cases include:
- Chat-assistants
- Customer service
Audiences that we hope will benefit from our model:
- Academics: For those researching Hindi natural language processing.
- Businesses: Companies targeting Hindi-speaking audiences.
- Developers: Those integrating Hindi language capabilities in apps.
Out-of-Scope Use
While Llama-3.1-Nanda-87B-Chat is a powerful Hindi and English bilingual model, it's essential to understand its limitations and the potential of misuse. It is prohibited to use the model in any manner that violates applicable laws or regulations. The following are some example scenarios where the model should not be used.
Malicious Use: The model should not be used for generating harmful, misleading, or inappropriate content. This includes but is not limited to:
- Generating or promoting hate speech, violence, or discrimination
- Spreading misinformation or fake news
- Engaging in or promoting illegal activities
Sensitive Information: The model should not be used to handle or generate personal, confidential, or sensitive information.
Generalization Across All Languages: Llama-3.1-Nanda-87B-Chat is bilingual and optimized for Hindi and English, it should not be assumed to have equal proficiency in other languages.
High-Stakes Decisions: The model should not be used to make high-stakes decisions without human oversight. This includes medical, legal, financial, or safety-critical decisions.
Bias, Risks, and Limitations
We have employed different techniques to reduce bias in the model. While efforts have been made to minimize biases, it is likely that the model, as with all LLM models, will exhibit some bias.
The model is trained as an AI assistant for Hindi and English speakers. The model is limited to producing responses for queries in these two languages and may not produce appropriate responses to other language queries.
By using Llama-3.1-Nanda-87B-Chat, you acknowledge and accept that, as with any large language model, it may generate incorrect, misleading and/or offensive information or content. The information is not intended as advice and should not be relied upon in any way, nor are we responsible for any of the content or consequences resulting from its use. We are continuously working to develop models with greater capabilities, and as such, we welcome any feedback on the model
Recommendations
It is recommended that users:
- Avoid using the model in sensitive domains without human oversight.
- Verify the accuracy of factual information provided by the model.
- Regularly evaluate the model to ensure it aligns with ethical guidelines.
Terms of use
By accessing this model, you are agreeing to the LLama 3.1 terms and conditions of the license, acceptable use policy and Meta’s privacy policy
- Downloads last month
- 167
Model tree for MBZUAI-IFM/Llama-3.1-Nanda-87B-Chat
Base model
meta-llama/Llama-3.1-70B