Generate Synthetic Reasoning Data Agents

What is Chain-of-Thought Generation?

Chain-of-Thought (CoT) Generation is a process where AI agents create detailed, step-by-step reasoning paths for solving problems. This involves generating questions, evaluating them, producing detailed solution steps, and making the data available for training and analysis.

Quick Start

Install Package

First, install the PraisonAI Agents package:

pip install "praisonaiagents[llm]" datasets huggingface-hub pandas

Set API Key

Set your OpenAI API key as an environment variable in your terminal:

export OPENAI_API_KEY=your_api_key_here
export HF_TOKEN=your_huggingface_token_here

Create a file

Create a new file app.py with the basic setup:

from praisonaiagents import Agent, Task, AgentTeam
from praisonaiagents import cot_save, cot_upload_to_huggingface
from pydantic import BaseModel
import os

# Define Pydantic model for structured output
class DecisionModel(BaseModel):
    response: str
    decision: str

def write_csv(file_path, data):
    """Write data to CSV file."""
    if not os.path.exists(file_path):
        with open(file_path, 'w') as file:
            file.write(data + '\n')
    else:
        with open(file_path, 'a') as file:
            file.write(data + '\n')
    return f"Data appended to {file_path}"

def count_questions(file_path):
    """Count lines in file."""
    with open(file_path, 'r') as file:
        return sum(1 for _ in file)

# Create specialized agents
qa_generator = Agent(
    name="Generator",
    role="Question Creator",
    goal="Create challenging math and logic questions",
    backstory="Expert in educational content creation",
    llm="gpt-4o-mini",
    tools=[write_csv, count_questions]
)

total_questions_evaluator = Agent(
    name="TotalQuestionsEvaluator",
    role="Total Questions Evaluator",
    goal="Evaluate the total number of questions in qa_pairs.csv file",
    backstory="Expert in evaluating the total number of questions in a file",
    llm="gpt-4o-mini",
    tools=[count_questions],
    output="silent"
)

cot_generator = Agent(
    name="COTGenerator",
    role="Chain of Thought Specialist",
    goal="Generate and manage chain of thought solutions for Q&A pairs",
    backstory="Expert in breaking down problems and generating detailed solution steps",
    tools=[cot_save],
    llm="gpt-4o-mini",
    output="silent"
)

upload_to_huggingface = Agent(
    name="UploadToHuggingface",
    role="Upload to Huggingface",
    goal="Upload the generated chain of thought solutions to a Huggingface dataset",
    backstory="Expert in saving data to Huggingface",
    tools=[cot_upload_to_huggingface],
    llm="gpt-4o-mini",
    output="silent"
)

# Create workflow with repeat pattern for generation
from praisonaiagents import AgentFlow, Task, WorkflowContext, StepResult
from praisonaiagents import repeat, loop

# Step handlers using agents
def generate_qa(ctx: WorkflowContext) -> StepResult:
    result = qa_generator.chat("""Generate question and answer in csv format: question, answer
    Generate 10 unique questions and answers. Example:
    What is the sum of numbers from 1 to 10?, 55
    Number of r's in the word strawberry, 3""")
    write_csv("qa_pairs.csv", result)
    return StepResult(output=result)

def evaluate_count(ctx: WorkflowContext) -> StepResult:
    count = count_questions("qa_pairs.csv")
    return StepResult(
        output=f"count: {count}",
        variables={"question_count": count}
    )

def generate_cot(ctx: WorkflowContext) -> StepResult:
    result = cot_generator.chat(f"Generate chain of thought for: {ctx.variables.get('current_item')}")
    cot_save(result)
    return StepResult(output=result)

def upload_dataset(ctx: WorkflowContext) -> StepResult:
    result = upload_to_huggingface.chat("Upload cot_solutions.csv to mervinpraison/cot-dataset")
    return StepResult(output=result)

# Create workflow
workflow = AgentFlow(
    steps=[
        generate_qa,
        evaluate_count,
        loop(generate_cot, over="qa_pairs", from_csv="qa_pairs.csv"),
        upload_dataset
    ]
)

result = workflow.start("Generate reasoning data")

Run the application

Execute the Python script to start generating chain-of-thought data:

python app.py

Features

Question Generation

Create challenging math and logic questions with answers.

Question Evaluation

Evaluate and validate generated questions for quality.

CoT Solutions

Generate detailed chain-of-thought solutions for each question.

Data Management

Save and manage generated data in structured formats.

HuggingFace Integration

Upload datasets directly to HuggingFace for sharing.

Understanding the Workflow

Key Components

Question Generator

Creates unique math and logic questions with answers. Uses write_csv and count_questions tools.

Questions Evaluator

Validates the total number of generated questions. Uses count_questions tool.

CoT Generator

Produces detailed step-by-step solutions. Uses cot_save tool for solution management.

HuggingFace Uploader

Publishes datasets to HuggingFace. Uses cot_upload_to_huggingface tool.

Task Types and Flow Control

Decision Tasks
Loop Tasks

Used in question generation and evaluation phases.

Decision Task Example

generate_task = Task(
    task_type="decision",
    condition={
        "more": "generate_task",
        "done": "evaluate_total_questions"
    }
)

Conditions determine whether to continue generating or move forward. The task can loop back to itself or proceed to the next task.

Used in Chain-of-Thought generation phase.

Loop Task Example

generate_cot_task = Task(
    task_type="loop",
    input_file="qa_pairs.csv",
    output_pydantic=DecisionModel
)

Always use Pydantic models for output validation in loop tasks to ensure data consistency.

Each task type serves a specific purpose in the workflow:

Decision Tasks: Control flow and branching logic
Loop Tasks: Process data iteratively with validation

Next Steps

Introduction

Learn more about PraisonAI and its core concepts

Quick Start

Get started with the basics of PraisonAI

Getting Started

Core Concepts

Guides

Features

Models

Databases

Observability

Memory

Knowledge

RAG

Persistence

Tools

Other Features

Developers

Configuration

Best Practices

Getting Started (No Code)

Generate Synthetic Reasoning Data Agents

What is Chain-of-Thought Generation?

Quick Start

Features

Question Generation

Question Evaluation

CoT Solutions

Data Management

HuggingFace Integration

Understanding the Workflow

Question Generator

Questions Evaluator

CoT Generator

HuggingFace Uploader

Next Steps

Introduction

Quick Start

Getting Started

Core Concepts

Guides

Features

Models

Databases

Observability

Memory

Knowledge

RAG

Persistence

Tools

Other Features

Developers

Configuration

Best Practices

Getting Started (No Code)

Documentation Index

​What is Chain-of-Thought Generation?

​Quick Start

​Features

Question Generation

Question Evaluation

CoT Solutions

Data Management

HuggingFace Integration

​Understanding the Workflow

Question Generator

Questions Evaluator

CoT Generator

HuggingFace Uploader

​Next Steps

Introduction

Quick Start

What is Chain-of-Thought Generation?

Quick Start

Features

Understanding the Workflow

Next Steps