Open-Source AI Cookbook documentation

Documentation Chatbot with Meta Synthetic Data Kit

Open-Source AI Cookbook

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

Documentation Chatbot with Meta Synthetic Data Kit

Authored by: Alan Ponnachan

This notebook demonstrates a practical approach to building a domain-specific Question & Answering chatbot. We’ll focus on creating a chatbot that can answer questions about a specific piece of documentation – in this case, LangChain’s documentation on Chat Models.

Goal: To fine-tune a small, efficient Language Model (LLM) to understand and answer questions about the LangChain Chat Models documentation.

Approach:

Data Acquisition: Obtain the text content from the target LangChain documentation page.
Synthetic Data Generation: Use Meta’s synthetic-data-kit to automatically generate Question/Answer pairs from this documentation.
Efficient Fine-tuning: Employ Unsloth and Hugging Face's TRL SFTTrainer to efficiently fine-tune a Llama-3.2-3B model on the generated synthetic data.
Evaluation: Test the fine-tuned model with specific questions about the documentation.

This method allows us to adapt an LLM to a niche domain without requiring a large, manually curated dataset.

Hardware Used:

This notebook was run on Google Colab (Free Tier) with an NVIDIA T4 GPU

1. Setup and Installation

First, we need to install the necessary libraries. We’ll use unsloth for efficient model handling and training, and synthetic-data-kit for generating our training data.

%%capture
# In Colab, we skip dependency installation to avoid conflicts with preinstalled packages.
# On local machines, we include dependencies for completeness.

import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth vllm==0.8.2
else:

    !pip install --no-deps unsloth vllm==0.8.2

# Get https://github.com/meta-llama/synthetic-data-kit
!pip install synthetic-data-kit

%%capture
import os
if "COLAB_" in "".join(os.environ.keys()):

    import sys, re, requests; modules = list(sys.modules.keys())
    for x in modules: sys.modules.pop(x) if "PIL" in x or "google" in x else None
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft "trl==0.15.2" triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf datasets huggingface_hub[hf_xet] hf_transfer

    # vLLM requirements - vLLM breaks Colab due to reinstalling numpy
    f = requests.get("https://raw.githubusercontent.com/vllm-project/vllm/refs/heads/main/requirements/common.txt").content
    with open("vllm_requirements.txt", "wb") as file:
        file.write(re.sub(rb"(transformers|numpy|xformers|importlib_metadata)[^\n]{0,}\n", b"", f))
    !pip install -r vllm_requirements.txt

2. Synthetic Data Generation

We’ll use SyntheticDataKit from Unsloth (which wraps Meta’s synthetic-data-kit) to create Question/Answer pairs from our chosen documentation.

>>> from unsloth.dataprep import SyntheticDataKit

>>> generator = SyntheticDataKit.from_pretrained(
...     model_name="unsloth/Llama-3.2-3B-Instruct",
...     max_seq_length=2048,
... )

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
INFO 05-05 15:14:48 [__init__.py:239] Automatically detected platform cuda.

generator.prepare_qa_generation(
    output_folder="data",  # Output location of synthetic data
    temperature=0.7,  # Higher temp makes more diverse data
    top_p=0.95,
    overlap=64,  # Overlap portion during chunking
    max_generation_tokens=512,  # Can increase for longer QA pairs
)

>>> !synthetic-data-kit system-check

[?25l[32m VLLM server is running at [0m[4;94mhttp://localhost:8000/v1[0m
[32m⠋[0m[32m Checking VLLM server at http://localhost:8000/v1...[0m
[2KAvailable models: [1m&#123;[0m[32m'object'[0m: [32m'list'[0m, [32m'data'[0m: [1m[[0m[1m&#123;[0m[32m'id'[0m: 
[32m'unsloth/Llama-3.2-3B-Instruct'[0m, [32m'object'[0m: [32m'model'[0m, [32m'created'[0m: [1;36m1746459182[0m, 
[32m'owned_by'[0m: [32m'vllm'[0m, [32m'root'[0m: [32m'unsloth/Llama-3.2-3B-Instruct'[0m, [32m'parent'[0m: [3;35mNone[0m, 
[32m'max_model_len'[0m: [1;36m2048[0m, [32m'permission'[0m: [1m[[0m[1m&#123;[0m[32m'id'[0m: 
[32m'modelperm-5296f16bbd3c425a82af4d2f84f0cbfe'[0m, [32m'object'[0m: [32m'model_permission'[0m, 
[32m'created'[0m: [1;36m1746459182[0m, [32m'allow_create_engine'[0m: [3;91mFalse[0m, [32m'allow_sampling'[0m: [3;92mTrue[0m, 
[32m'allow_logprobs'[0m: [3;92mTrue[0m, [32m'allow_search_indices'[0m: [3;91mFalse[0m, [32m'allow_view'[0m: [3;92mTrue[0m, 
[32m'allow_fine_tuning'[0m: [3;91mFalse[0m, [32m'organization'[0m: [32m'*'[0m, [32m'group'[0m: [3;35mNone[0m, [32m'is_blocking'[0m: 
[3;91mFalse[0m[1m}[0m[1m][0m[1m}[0m[1m][0m[1m}[0m
[32m⠋[0m Checking VLLM server at http://localhost:8000/v1...
[2K[32m⠋[0m Checking VLLM server at http://localhost:8000/v1...
[?25h
[1A[2K

2.1. Acquire and Ingest Documentation

For this example, we’ll use the LangChain documentation page on Chat Models.

To get the text:

Go to the raw version of the MDX file (e.g., by clicking “Raw” on GitHub).
Copy the entire text content.
Save it locally as a .txt file. For this notebook, we assume you’ve saved it as /content/langchain-ai-langchain.txt. You can use a tool like gitingest or manual copy-paste.

Note: Ensure the text file is uploaded to your Colab environment at /content/langchain-ai-langchain.txt if you’re running this in Colab.

>>> # Make sure synthetic_data_kit_config.yaml points to the 'data_docs' folder
>>> !synthetic-data-kit -c synthetic_data_kit_config.yaml ingest /content/langchain-ai-langchain.txt

[?25l[32m⠋[0m Processing /content/langchain-ai-langchain.txt...
[?25h
[1A[2K[32m Text successfully extracted to [0m[1;32mdata/output/langchain-ai-langchain.txt[0m

2.2. Chunk Data and Generate QA Pairs

The ingested document will be split into smaller chunks, and then QA pairs will be generated for each chunk.

>>> filenames = generator.chunk_data("data/output/langchain-ai-langchain.txt")
>>> print(f"Created {len(filenames)} chunks.")

Created 3 chunks.

import time
# Process 2 chunks for now -> can increase but slower!
for filename in filenames[:2]:
    !synthetic-data-kit \
        -c synthetic_data_kit_config.yaml \
        create {filename} \
        --num-pairs 25 \
        --type "qa"
    time.sleep(2) # Sleep some time to leave some room for processing

2.3. Format and Save QA Pairs

The generated QA pairs are then converted into a format suitable for fine-tuning.

>>> qa_pairs_filenames = [
...     f"data/generated/langchain-ai-langchain_{i}_qa_pairs.json"
...     for i in range(len(filenames[:2]))
... ]
>>> for filename in qa_pairs_filenames:
...     !synthetic-data-kit \
...         -c synthetic_data_kit_config.yaml \
...         save-as {filename} -f ft

[?25l[32m⠋[0m Converting data/generated/langchain-ai-langchain_0_qa_pairs.json to ft format 
with json storage...
[?25h
[1A[2K[1A[2K[32m Converted to ft format and saved to [0m
[1;32mdata/final/langchain-ai-langchain_0_qa_pairs_ft.json[0m
[?25l[32m⠋[0m Converting data/generated/langchain-ai-langchain_1_qa_pairs.json to ft format 
with json storage...
[1A[2K[1A[2K[32m Converted to ft format and saved to [0m
[1;32mdata/final/langchain-ai-langchain_1_qa_pairs_ft.json[0m

>>> generator.cleanup()

Attempting to terminate the VLLM server gracefully...
Server did not terminate gracefully after 10 seconds. Forcing kill...
Server killed forcefully.

2.4. Load the Formatted Dataset

Now, let’s load the generated and formatted data.

from datasets import Dataset
import pandas as pd

final_filenames = [f"data/final/langchain-ai-langchain_{i}_qa_pairs_ft.json" for i in range(len(filenames[:2]))]
conversations = pd.concat([pd.read_json(name) for name in final_filenames]).reset_index(drop=True)

dataset = Dataset.from_pandas(conversations)

dataset[0]

dataset[-1]

Memory Management Note (Critical for Resource-Constrained Environments)

If you encounter CUDA Out-of-Memory (OOM) errors when trying to load the Llama model for fine-tuning in the next steps (even after generator.cleanup()), it means the GPU memory wasn’t fully released. This is common in environments like Google Colab’s free tier.

Workaround Strategy:

Archive the generated data: After the generator.cleanup() cell, zip the entire /content/data folder and download to local.
Restart the Colab Runtime: Go to “Runtime” -> “Restart runtime…“. This completely clears GPU memory.
Re-run Installations & Imports: Execute the initial installation cells and necessary import cells again.
Restore Data: Upload the zip data folder and Unzip your data.
Load Dataset from Restored Files: Use a script to load from the unzipped /content/data/final/ directory.
Proceed to model loading and fine-tuning.

The cells below include commands for zipping. If you restart, you’d manually run the unzip and data loading code from the “Optional: Restart and Reload Data” section.

# !zip -r data.zip /content/
# !unzip data.zip

# import os
# import pandas as pd
# from datasets import Dataset

# # Path to your folder containing JSON files
# folder_path = 'content/data/final/'

# # List all .json files in the folder
# final_filenames = [os.path.join(folder_path, f) for f in os.listdir(folder_path) if f.endswith('.json')]

# # Read and combine the JSON files
# conversations = pd.concat([
#     pd.read_json(name) for name in final_filenames
# ]).reset_index(drop=True)

# # Convert to Hugging Face Dataset
# dataset = Dataset.from_pandas(conversations)

3. Fine-tuning the LLM with Unsloth

Now, we’ll load our base model using Unsloth for 4-bit quantization and then fine-tune it on our synthetically generated dataset.

3.1. Load Base Model and Tokenizer

We’ll use Llama-3.2-3B-Instruct in 4-bit precision. Unsloth makes this very memory-efficient.

>>> from unsloth import FastLanguageModel
>>> import torch


>>> model, tokenizer = FastLanguageModel.from_pretrained(
...     model_name="unsloth/Llama-3.2-3B-Instruct",
...     max_seq_length=1024,  # Choose any for long context!
...     load_in_4bit=True,  # 4 bit quantization to reduce memory
...     load_in_8bit=False,  # [NEW!] A bit more accurate, uses 2x memory
...     full_finetuning=False,  # [NEW!] We have full finetuning now!
... )

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
INFO 05-05 15:54:31 [__init__.py:239] Automatically detected platform cuda.
==((====))==  Unsloth 2025.4.7: Fast Llama patching. Transformers: 4.51.3. vLLM: 0.8.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!

3.2. Add LoRA Adapters

We use LoRA (Low-Rank Adaptation) for parameter-efficient fine-tuning.

model = FastLanguageModel.get_peft_model(
    model,
    r=16,  # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
    ],
    lora_alpha=16,
    lora_dropout=0,  # Supports any, but = 0 is optimized
    bias="none",  # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing="unsloth",  # True or "unsloth" for very long context
    random_state=3407,
    use_rslora=False,  # We support rank stabilized LoRA
    loftq_config=None,  # And LoftQ
)

3.3. Data Preparation for Chat Format

We need to format our dataset into the chat template expected by the Llama-3.2 model.

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 01 May 2025

You are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

What is 1+1?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

2<|eot_id|>

def formatting_prompts_func(examples):
    convos = examples["messages"]
    texts = [tokenizer.apply_chat_template(convo, tokenize=False, add_generation_prompt=False) for convo in convos]
    return {
        "text": texts,
    }


pass

# Get our previous dataset and format it:
dataset = dataset.map(
    formatting_prompts_func,
    batched=True,
)

dataset[0]

3.4. Train the Model

We’ll use Hugging Face TRL’s SFTTrainer to fine-tune the model with the SFTTrainer class—designed specifically for supervised fine-tuning (SFT). We configure the training parameters using SFTConfig, specifying the dataset, model, training steps, and optimization settings. This setup allows us to efficiently fine-tune models with gradient accumulation and mixed-precision optimizers like adamw_8bit, onlimited hardware environment.

from trl import SFTTrainer, SFTConfig

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    eval_dataset=None,  # Can set up evaluation!
    args=SFTConfig(
        dataset_text_field="text",
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,  # Use GA to mimic batch size!
        warmup_steps=5,
        max_steps=60,
        learning_rate=2e-4,
        logging_steps=1,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        report_to="none",  # Use this for WandB etc
    ),
)

trainer_stats = trainer.train()

4. Inference and Testing

Let’s test our fine-tuned model with some questions related to the LangChain Chat Models documentation.

>>> messages = [
...     {"role": "user", "content": "What is the standard interface for binding tools to models?"},
... ]
>>> inputs = tokenizer.apply_chat_template(
...     messages,
...     tokenize=True,
...     add_generation_prompt=True,  # Must add for generation
...     return_tensors="pt",
... ).to("cuda")

>>> from transformers import TextStreamer

>>> text_streamer = TextStreamer(tokenizer, skip_prompt=True)
>>> _ = model.generate(input_ids=inputs, streamer=text_streamer, max_new_tokens=256, temperature=0.1)

Standard [tool calling API](/docs/concepts/tool_calling): standard interface for binding tools to models.<|eot_id|>

5. Conclusion

We have successfully:

Acquired documentation text for LangChain Chat Models..
Generated synthetic Question/Answer pairs using synthetic-data-kit.
Fine-tuned a Llama-3.2-3B model efficiently using Unsloth and Hugging Face’s TRL SFTTrainer.
Tested the model’s ability to answer questions specific to the documentation.

This notebook provides a template for creating specialized chatbots for various documentation or domain-specific texts. The use of synthetic data generation and efficient fine-tuning techniques makes this approach accessible even with limited resources.

Further improvements could include:

Using a larger portion of the documentation or multiple related pages.
More sophisticated curation of synthetic QA pairs.
Experimenting with different base models or hyperparameter tuning.
Implementing a more robust evaluation framework (e.g., comparing against a held-out set of questions or using metrics like ROUGE, BLEU if applicable, or LLM-as-a-judge).

< > Update on GitHub

←HuatuoGPT-o1 Medical RAG and Reasoning Fine-tuning a Vision Transformer Model With a Custom Biomedical Dataset→