Objective:
Create a high-quality, instruction-tuned Chain-of-Thought (CoT) dataset in Hindi (and other Indic languages) by distilling reasoning traces from GPT-4 / R1 onto samples from open-source Indic corpora (e.g., Sangraha). Then fine-tune Sarvam-1 to produce “Sarvam-1-CoT,” capable of step-by-step explanations in Hindi.
Source | Content | Link |
---|---|---|
Sangraha (Hindi split) | ~34 GB of curated Hindi text in Parquet shards | https://huggingface.co/datasets/ai4bharat/sangraha/tree/main/verified/hin/ |
IndicGLUE / MMLU (Indic) | Reasoning & knowledge benchmarks in 10 Indic languages | https://github.com/AI4Bharat/indic-glue; https://github.com/huggingface/transformers/tree/main/examples/research_projects/mmlu |
IndicGenBench | Summarization & QA tasks in Indic | https://github.com/google-research/indic-GenBench |
-
Shard Selection
- Randomly sample 12 Parquet files from Sangraha’s
verified/hin/
directory. - Within each shard, select 100 examples → 1,200 total.
- Randomly sample 12 Parquet files from Sangraha’s
-
Prompt Template
You are an expert in Hindi explanations. Think step-by-step. Q: "{input_text}" A (Chain of Thought): 1. 2. … Answer:
-
Distillation with GPT-4 / R1
- For each sample, send the above prompt to GPT-4 (or internal R1).
- Extract the numbered reasoning steps + final answer.
- Store as JSONL:
{ "instruction": "Explain step-by-step in Hindi", "input": "...", "output": "1. …\n2. …\nAnswer: \"…\"" }
-
Quality Control
- Automatic filters: ensure ≥3 steps, answer length ≤256 tokens.
- Human spot-check 5% for coherence & faithfulness.
- Base model:
sarvamai/sarvam-1
(2 B parameters) - Toolkit: Hugging Face Transformers &
bitsandbytes
4-bit quantization - Training recipe:
from datasets import load_dataset from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments ds = load_dataset("json", data_files="indic_cot.jsonl", split="train") tokenizer = AutoTokenizer.from_pretrained("sarvamai/sarvam-1") model = AutoModelForCausalLM.from_pretrained( "sarvamai/sarvam-1", load_in_4bit=True, device_map="auto", torch_dtype=torch.float16 ) def preprocess(ex): enc = tokenizer(f"{ex['instruction']}\nQ: {ex['input']}\nA:", truncation=True) lbl = tokenizer(ex["output"], truncation=True) return {"input_ids": enc["input_ids"], "labels": lbl["input_ids"]} ds = ds.map(preprocess, remove_columns=ds.column_names) args = TrainingArguments( output_dir="sarvam1-cot", epochs=3, per_device_train_batch_size=8, fp16=True, save_total_limit=2, logging_steps=50, ) Trainer(model, args, train_dataset=ds).train()
Stage | Resources | Duration |
---|---|---|
Data sampling & distill | 1 × A100 (GPT-4 distill) | 1 week |
QC & cleanup | 0.5 FTE annotator | 1 week |
Finetuning Sarvam-1-CoT | 4 × A100 (4-bit, 16 GB each) | 2 weeks |
Evaluation & analysis | 2 × A100 | 1 week |
Total: ~5 weeks
- Novelty: First open-source Indic CoT dataset; “Sarvam-1-CoT” will be the first Indian LLM exhibiting explicit chain-of-thought reasoning.
- Feasibility: Leverages existing high-quality corpora (Sangraha), mature distillation (GPT-4/R1), and efficient fine-tuning (4-bit quant).
- Follow-up: Expand to other Indic languages, integrate into conversational UI, downstream tasks (QA, tutoring, reasoning).
For questions or collaboration, reach out to naman@encryptech.ai.