Bespoke Curator

Bulk Inference and Scalable Data Curation for Post-Training

[ English | 中文 ]

🎉 What's New

[2025.04.09] Launching Reasoning Datasets Competition with HuggingFace and Together.ai. Win $5000 USD worth of prizes!
[2025-04-03] We used Bespoke Curator to create OpenThoughts2-1M dataset, which was used to train OpenThinker2-32B that outperforms DeepSeek-R1-32B. The dataset started trending on HuggingFace.
[2025.03.12] Gemini Batch support added: Gemini batch API is extremely challenging, and we made it much simpler! :)
[2025.03.05] Claude 3.7 Sonnet Thinking and batch mode support added.
[2025.02.26] Code Execution Support added: You can now run code (generated by Curator) using CodeExecutor. We support four backends: local (called multiprocessing), Ray, Docker and e2b.
[2025.02.06] We used Bespoke Curator to create s1K-1.1, a high-quality sample-efficient reasoning dataset.
[2025.01.30] Batch Processing Support for OpenAI, Anthropic, and other compatible APIs: Cut Token Costs in Half 🔥🔥🔥. Through our partnership with kluster.ai, new users using Curator can access open-source models like DeepSeek-R1 and receive a $25 credit (limits apply). EDIT: Promotion has come to an end.
[2025.01.27] We used Bespoke Curator to create OpenThoughts-114k, a high-quality reasoning dataset (trending on HuggingFace).
[2025.01.22] We used Bespoke Curator to create Bespoke-Stratos-17k, a high-quality reasoning dataset (trending on HuggingFace).
[2025.01.15] Curator launched 🎉

Overview

Bespoke Curator makes it easy to create synthetic data pipelines. Whether you are training a model or extracting structured data, Curator will prepare high-quality data quickly and robustly.

Rich Python based library for generating and curating synthetic data.
Viewer to monitor data while it is being generated.
First class support for structured outputs.
Built-in performance optimizations for asynchronous operations, caching, and fault recovery at every scale.
Support for a wide range of inference options via LiteLLM, vLLM, and popular batch APIs.

Check out our full documentation for getting started, tutorials, guides and detailed reference.

🛠️ Installation

pip install bespokelabs-curator

📕 Examples

Finetuning/Distillation

Task	Link(s)	Goal
Product feature extraction		Finetuning a model to identify features of a product
Sentiment analysis		Aspect-based sentiment analysis of restaurant reviews and finetuning using Together.ai
RAFT for domain-specific RAG	Code	Implement Retrieval Augmented Fine-Tuning (RAFT) that processes domain-specific documents, generates questions, and prepares data for fine-tuning LLMs.

Data Generation

Task	Link(s)	Goal
Reasoning dataset generation (Bespoke Stratos)	Code	Generate the Bespoke-Stratos-17k dataset, focusing on reasoning traces from math, coding, and problem-solving datasets.
Reasoning dataset generation (Open Thoughts)	Code	Generate the Open-Thoughts-114k dataset, focusing on reasoning traces from math, coding, and problem-solving datasets.
Multimodal	Code	Demonstrates multimodal capabilities by generating recipes from food images
Ungrounded Question Answer generation	Code	Generate diverse question-answer pairs using techniques similar to the CAMEL paper
Code Execution		Execute code generated with Curator
3Blue1Brown video generation	Code	Generate videos similar to 3Blue1Brown and render them using code execution!
Synthetic charts	Code	Generate charts synthetically.
Function calling	Code	Generate data for finetuning for function calling.

🚀 Quickstart

Using `curator.LLM` for Bulk Inference

from typing import Dict
from bespokelabs import curator
from datasets import Dataset
from pydantic import BaseModel, Field
from typing import Literal

class Sentiment(BaseModel):
  sentiment: Literal["positive", "negative", "neutral"] = Field(
    description="Sentiment of the review")

class SentimentAnalyzer(curator.LLM):

  def prompt(self, product: Dict):
    return f"Determine the sentiment of the product from the review: {product['review']}"

  def parse(self, product: Dict, response: Sentiment):
    return [{"name": product["name"], "sentiment": response.sentiment}]

# You can easily have a million rows here. 
# Curator takes care of parallelism, retries, and caches responses.
dataset = [{"name": "Curator", "review": "Already saved hours in one day of use."},
           {"name": "Bespoke MiniCheck", "review": "Hallucination rates dropped by 90%."}]

# You can set batch=True, and instantly uses batch mode to save 50% of the costs.
analyzer = SentimentAnalyzer(
    model_name="gpt-4o-mini", response_format=Sentiment, batch=False)
reviews = analyzer(dataset)
print(reviews.to_pandas())

Output:

                name sentiment
0            Curator  positive
1  Bespoke MiniCheck  positive

In the SentimentAnalyzer class:

prompt takes the input (product) and returns the prompt for the LLM.
parse takes the input (product) and the structured output (response) and converts it to a list of dictionaries. This is so that we can easily convert the output to a HuggingFace Dataset object.

Instead of a list, you can pass a HuggingFace Dataset object as well (see below for more details).

Using `curator.LLM` for data generation

Here's an example of using structured outputs and chaing together two curator.LLM blocks to generate diverse poems.

from typing import Dict, List
from bespokelabs import curator
from pydantic import BaseModel, Field

class Topics(BaseModel):
    topics_list: List[str] = Field(description="A list of topics.")

class TopicGenerator(curator.LLM):
  response_format = Topics

  def prompt(self, subject):
    return f"Return 3 topics related to {subject}"

  def parse(self, input: str, response: Topics):
    return [{"topic": t} for t in response.topics_list]


class Poem(BaseModel):
    title: str = Field(description="The title of the poem.")
    poem: str = Field(description="The content of the poem.")

class Poet(curator.LLM):
    response_format = Poem

    def prompt(self, input: Dict) -> str:
        return f"Write two poems about {input['topic']}."

    def parse(self, input: Dict, response: Poem) -> Dict:
        return [{"title": response.title, "poem": response.poem}]

topic_generator = TopicGenerator(model_name="gpt-4o-mini")
poet = Poet(model_name="gpt-4o-mini")
# Start generation
topics = topic_generator("Mathematics")
poems = poet(topics)

Output:

 	title                     poem
0	The Language of Algebra	  In symbols and signs, truths intertwine,..
1	The Geometry of Space	  In the world around us, shapes do collide,..
2	The Language of Logic	  In circuits and wires where silence speaks,..

You can see more examples in the examples directory.

See the docs for more details as well as for troubleshooting information.

Tip

If you are generating large datasets, you may want to use batch mode to save costs. Currently batch APIs from OpenAI and Anthropic are supported. With curator this is as simple as setting batch=True in the LLM class. [!NOTE] Retries and caching are enabled by default to help you rapidly iterate your data pipelines. So now if you run the same prompt again, you will get the same response, pretty much instantly. You can delete the cache at ~/.cache/curator or disable it with export CURATOR_DISABLE_CACHE=true.

Important

Make sure to set your API keys as environment variables for the model you are calling. For example running export OPENAI_API_KEY=sk-... and export ANTHROPIC_API_KEY=ant-... will allow you to run the previous two examples. A full list of supported models and their associated environment variable names can be found in the litellm docs.

Anonymized Telemetry

We collect minimal, anonymized usage telemetry to help prioritize new features and improvements that benefit the Curator community. You can opt out by setting the TELEMETRY_ENABLED environment variable to False.

📖 Providers

Curator supports a wide range of providers, including OpenAI, Anthropic, and many more.

OpenAI backend

llm = curator.LLM(
    model_name="gpt-4o-mini",
)

For other models that support OpenAI-compatible APIs, you can use the openai backend:

llm = curator.LLM(
    model_name="gpt-4o-mini",
    backend="openai",
    backend_params={
        "base_url": "https://your-openai-compatible-api-url",
        "api_key": <YOUR_OPENAI_COMPATIBLE_SERVICE_API_KEY>,
    },
)

LiteLLM (Anthropic, Gemini, together.ai, etc.)

Here is an example of using Gemini with litellm backend:

llm = curator.LLM(
    model_name="gemini/gemini-1.5-flash",
    backend="litellm",
    backend_params={
        "max_requests_per_minute": 2_000,
        "max_tokens_per_minute": 4_000_000
    },
)

Documentation

Ollama

llm = curator.LLM(
    model_name="ollama/llama3.1:8b",  # Ollama model identifier
    backend_params={"base_url": "http://localhost:11434"},
)

Documentation

vLLM

llm = curator.LLM( 
    model_name="Qwen/Qwen2.5-3B-Instruct", 
    backend="vllm", 
    backend_params={ 
        "tensor_parallel_size": 1, # Adjust based on GPU count 
        "gpu_memory_utilization": 0.7 
    }
)

Documentation

DeepSeek

DeepSeek offers an OpenAI-compatible API that you can use with the openai backend.

Important

The DeepSeek API is experiencing intermittent issues and will return empty responses during times of high traffic. We recommend calling the DeepSeek API through the openai backend, with a high max retries so that we can retry failed requests upon empty response and a reasonable max requests and tokens per minute so we don't retry too aggressively and overwhelm the API.

llm = curator.LLM(
    model_name="deepseek-reasoner",
    generation_params={"temp": 0.0},
    backend_params={
        "max_requests_per_minute": 100,
        "max_tokens_per_minute": 10_000_000,
        "base_url": "https://api.deepseek.com/",
        "api_key": <YOUR_DEEPSEEK_API_KEY>,
        "max_retries": 50,
    },
    backend="openai",
)

kluster.ai

llm = curator.LLM(
    model_name="deepseek-ai/DeepSeek-R1", 
    backend="klusterai",
)

Documentation

📦 Batch Mode

Several providers offer about 50% discount on token usage when using batch mode. Curator makes it easy to use batch mode with a wide range of providers.

Example with OpenAI (docs reference):

llm = curator.LLM(model_name="gpt-4o-mini", batch=True)

See documentation:

Bespoke Curator Viewer

The hosted curator viewer is a rich interface to visualize data -- and makes visually inspecting the data much easier.

You can enable it as follows:

Bash:

export CURATOR_VIEWER=1

Python/colab:

import os
os.environ["CURATOR_VIEWER"]="1"

With this enabled, as curator generates data, it gets uploaded and you can see the responses streaming in the viewer. The URL for the viewer is displayed right next to the rich progress.

Environment Variables

We support a range of environment variables to customize the behavior of Curator.

Here is a complete table of environment variables:

Variable	Description	Default
`CURATOR_VIEWER`	Enables the Curator viewer for visualizing data curation when `True`.	`False`
`CURATOR_DISABLE_CACHE`	Disables caching for `curator.LLM` generations when `True`. Useful for fresh runs.	`False`
`CURATOR_CACHE_DIR`	Sets the cache directory used for `curator.LLM` generations.	`~/.cache/curator`
`CURATOR_DISABLE_RICH_DISPLAY`	When `True`, disables Rich CLI output (and falls back to tqdm logging) for local data generation monitoring. This is useful when debugging with inline breakpoints or interactive debuggers like `pdb`, where Rich's dynamic output can interfere with terminal input.	`False`
`TELEMETRY_ENABLED`	Enable telemetry for curator usage tracking when `True`	`True`

Contributing

Thank you to all the contributors for making this project possible! Please follow these instructions on how to contribute.

Citation

If you find Curator useful, please consider citing us!

@software{Curator: A Tool for Synthetic Data Creation,
  author = {Marten, Ryan* and Vu, Trung* and Ji, Charlie Cheng-Jie and Sharma, Kartik and Pimpalgaonkar, Shreyas and Dimakis, Alex and Sathiamoorthy, Maheswaran},
  month = jan,
  title = {{Curator}},
  year = {2025},
  howpublished = {\url{https://github.com/bespokelabsai/curator}}
}

Name		Name	Last commit message	Last commit date
Latest commit History 1,544 Commits
.github/workflows		.github/workflows
docs		docs
examples		examples
src/bespokelabs		src/bespokelabs
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
poetry.lock		poetry.lock
publish_pkg.sh		publish_pkg.sh
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Bespoke Curator

Bulk Inference and Scalable Data Curation for Post-Training

🎉 What's New

Overview

🛠️ Installation

📕 Examples

Finetuning/Distillation

Data Generation

🚀 Quickstart

Using `curator.LLM` for Bulk Inference

Using `curator.LLM` for data generation

Anonymized Telemetry

📖 Providers

OpenAI backend

LiteLLM (Anthropic, Gemini, together.ai, etc.)

Ollama

vLLM

DeepSeek

kluster.ai

📦 Batch Mode

Bespoke Curator Viewer

Environment Variables

Contributing

Citation

About

Uh oh!

Releases

Packages

Languages

License

Mithil467/curator

Folders and files

Latest commit

History

Repository files navigation

Bespoke Curator

Bulk Inference and Scalable Data Curation for Post-Training

🎉 What's New

Overview

🛠️ Installation

📕 Examples

Finetuning/Distillation

Data Generation

🚀 Quickstart

Using curator.LLM for Bulk Inference

Using curator.LLM for data generation

Anonymized Telemetry

📖 Providers

OpenAI backend

LiteLLM (Anthropic, Gemini, together.ai, etc.)

Ollama

vLLM

DeepSeek

kluster.ai

📦 Batch Mode

Bespoke Curator Viewer

Environment Variables

Contributing

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Using `curator.LLM` for Bulk Inference

Using `curator.LLM` for data generation

Packages