Official Documentation

FineFoundry Documentation

Complete guides for installation, usage, training, inference, and deployment.

Quick Start Guide

Get from cloned repo to a running desktop app and your first dataset in minutes.

Prerequisites

  • Python 3.10+ (Windows, macOS, or Linux)
  • Git (optional, for cloning)
  • uv (recommended) or pip for package management

Optional for publishing:

Option 1: Using uv (Recommended)

# Clone the repository
git clone https://github.com/SourceBox-LLC/FineFoundry.git FineFoundry-Core
cd FineFoundry-Core

# Install uv if needed
pip install uv

# Run the application (uv handles dependencies automatically)
uv run src/main.py

Option 2: Using pip

# Clone the repository
git clone https://github.com/SourceBox-LLC/FineFoundry.git FineFoundry-Core
cd FineFoundry-Core

# Create and activate virtual environment
python -m venv venv

# Windows (PowerShell)
./venv/Scripts/Activate.ps1

# macOS/Linux
source venv/bin/activate

# Install dependencies
pip install -e .

# Run the application
python src/main.py

First Launch

When you launch FineFoundry, you'll see a desktop application with these tabs:

  1. Data Sources – Collect training data from 4chan, Reddit, or Stack Exchange
  2. Dataset Analysis – Analyze dataset quality
  3. Merge Datasets – Combine multiple datasets
  4. Training – Fine-tune models on RunPod or locally via Docker
  5. Inference – Run inference against adapters from completed training runs
  6. Publish – Publish datasets and LoRA adapters to Hugging Face
  7. Settings – Configure authentication and preferences

Your First Dataset

Step 1: Scrape Data

  1. Navigate to the Data Sources tab
  2. Select a few boards (e.g., pol, b, x)
  3. Configure parameters:
    • Max Threads: 50
    • Max Pairs: 500
    • Delay: 0.5 seconds
    • Min Length: 10 characters
  4. Click Start
  5. When complete, click Preview Dataset

Step 2: Publish (Optional)

  1. Navigate to the Publish tab
  2. Configure split ratios with sliders
  3. Click Build Dataset
  4. To publish: enable Push to Hub, set Repo ID, and click Push + Upload README

Installation

Detailed installation instructions for all platforms.

System Requirements

  • OS: Windows 10+, macOS 11+, or Linux (Ubuntu 20.04+)
  • Python: 3.10 or higher
  • RAM: 8GB minimum, 16GB+ recommended for training
  • GPU: Optional but recommended for local training (NVIDIA with CUDA support)

Installing uv

uv is a fast Python package manager that handles dependencies automatically:

# Install uv
pip install uv

# Verify installation
uv --version

Verifying Installation

# Check Python version
python --version  # Should be 3.10+

# Run FineFoundry
uv run src/main.py
💡 Tip: If you encounter dependency issues, try deleting the .venv folder and running uv run src/main.py again. uv will recreate the environment with fresh dependencies.

Data Sources Tab

Collect conversational training data from multiple sources and prepare it as input/output pairs.

Data Sources Tab Screenshot

Supported Sources

4chan

Multi-board scraping with quote-chain and cumulative pairing modes

Reddit

Subreddits or single posts with parent-child threading

Stack Exchange

Q&A pairs from accepted answers

Synthetic

Generate Q&A, CoT, or summaries from PDFs/docs using local LLMs

Parameters

  • Max Threads – Number of threads per board to sample
  • Max Pairs – Upper bound on input/output pairs to extract
  • Delay (s) – Polite delay between HTTP requests
  • Min Length – Minimum character count per side
  • Modenormal (adjacent posts) or contextual
  • Strategy (contextual only) – quote_chain, cumulative, or last_k
  • K – Context depth for contextual mode
  • Max Input Chars – Optional truncation of long contexts

Pairing Modes

Normal Mode

Creates pairs from adjacent posts. Simple and fast, but loses conversational context.

Contextual Mode

Builds context from the conversation thread:

  • quote_chain – Follows reply chains via quote references
  • cumulative – Accumulates all previous posts as context
  • last_k – Uses the last K posts as context

Output Format

[
  {"input": "What do you think about...", "output": "I believe that..."},
  {"input": "Can you explain...", "output": "Sure, here's how..."}
]
💡 Best Practice: Start with smaller runs (50 threads, 500 pairs) to validate your configuration before scaling up.
Offline Mode: Only the Synthetic data source can be used. Network sources (4chan/Reddit/Stack Exchange) are disabled.

Synthetic Data Generation

Generate training data from your own documents using local LLMs powered by Unsloth's SyntheticDataKit.

Supported Input Formats

  • PDF documents
  • DOCX (Word documents)
  • PPTX (PowerPoint)
  • HTML/HTM web pages
  • TXT plain text
  • URLs (fetched and parsed)

Generation Types

  • qa – Question-answer pairs from document content
  • cot – Chain-of-thought reasoning examples
  • summary – Document summaries

Synthetic Parameters

  • Model – Local LLM to use (default: unsloth/Llama-3.2-3B-Instruct)
  • Generation Type – qa, cot, or summary
  • Num Pairs – Target examples per chunk
  • Max Chunks – Maximum document chunks to process
  • Curate – Enable quality filtering with threshold
💡 Note: First run takes 30-60 seconds for model loading. A snackbar notification appears immediately when you click Start. Subsequent runs are faster.

Publish Tab

Publish datasets and (Phase 1) LoRA adapters to the Hugging Face Hub.

Publish Tab Screenshot

Workflow

  1. Select a Database Session from your scrape history
  2. Configure split ratios (train/validation/test)
  3. Set shuffle and seed for reproducibility
  4. Click Build Dataset
  5. Optionally enable Push to Hub and publish

Publish model (adapter)

If you have a completed training run, you can publish its LoRA adapter:

  1. Select a completed training run
  2. Set Model repo ID and privacy
  3. Click Publish adapter

Split Configuration

  • Seed – Controls shuffling deterministically
  • Shuffle – Whether to shuffle before splitting
  • Validation % – Fraction for validation set
  • Test % – Fraction for test set (remainder becomes train)
  • Min Length – Minimum characters for input/output

Hub Integration

  • Repo ID – e.g., username/my-dataset
  • Private – Create a private repository
  • HF Token – Your Hugging Face access token

The Push + Upload README button uploads your dataset with an auto-generated dataset card.

Example: Local Splits Only

  1. Select your database session from the dropdown
  2. Set Validation to 0.05, Test to 0.0
  3. Enable Shuffle, set Seed to 42
  4. Set Save dir to hf_dataset
  5. Click Build Dataset
Offline Mode: Hugging Face Hub actions are disabled (dataset push and adapter publishing).

Training Tab

Fine-tune language models using an Unsloth-based LoRA training stack on RunPod or locally via Docker.

Training Tab Screenshot

Training Targets

RunPod

Cloud GPU training with automated pod and network volume management

Local Docker

Train on your local GPU using the same Unsloth trainer image

Under the Hood

Both targets use docker.io/sbussiso/unsloth-trainer:latest with:

  • PyTorch – Accelerated training on CPU/GPU
  • Hugging Face Transformers – Model loading and tokenization
  • bitsandbytes – 4-bit quantization for memory efficiency
  • PEFT / LoRA – Parameter-efficient fine-tuning via Unsloth

Skill Levels

Beginner Mode

Simplifies choices with safe presets:

  • Fastest (RunPod) – Higher throughput on stronger GPUs
  • Cheapest (RunPod) – Conservative params for smaller GPUs
  • Quick local test – Short run for sanity checks
  • Auto Set (local) – Detects GPU VRAM and aggressively pushes throughput while still aiming to avoid OOM
  • Simple custom – Guided controls for duration, memory/stability, and speed vs quality
Auto Set (local) preset

Expert Mode

Full control over all hyperparameters for experienced users.

Hyperparameters

  • Base model – Default: unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit
  • Epochs – Number of training epochs
  • Learning rate – Step size for optimization
  • Batch size – Samples per device per step
  • Gradient accumulation – Steps before weight update
  • Max steps – Upper bound on training steps
  • Packing – Pack multiple short examples for throughput
  • Auto-resume – Continue from latest checkpoint

Quick Local Inference

After a successful local run, the Quick Local Inference panel appears:

  • Sample prompts – 5 random prompts from your training dataset are auto-loaded for quick testing
  • Enter a prompt manually or select a sample from the dropdown
  • Choose presets: Deterministic, Balanced, or Creative
  • Adjust temperature and max tokens with sliders
  • View prompt/response history
  • Export chats – Save your prompt/response history to a text file

The sample prompts feature lets you quickly verify your model learned from the training data without manually copying prompts.

Saving Configurations

Training configs are stored in the SQLite database:

  • Click Save current setup to snapshot your configuration
  • Use the dropdown to load saved configs
  • The last used config auto-loads on startup
  • All configs persist across sessions in finefoundry.db
Offline Mode: RunPod training and Hugging Face Hub actions are disabled; local-only workflows are enforced.

Inference Tab

Run local inference against adapters from completed training runs with prompt history and Full Chat View.

Inference Tab Screenshot

Features

  • Training run selection – Choose a completed training run (FineFoundry loads the adapter path automatically)
  • Instant Validation – Verifies adapter files before loading
  • Dataset Selector – Choose any saved dataset to sample prompts from
  • Sample Prompts – Get 5 random prompts from selected dataset for quick testing
  • Generation Presets – Deterministic, Balanced, Creative, or Custom
  • Full Chat View – Multi-turn conversation dialog with proper chat templates
  • Export Chats – Save prompt/response history to text files
  • Prompt History – Scroll through previous prompts and responses

Sample Prompts

The Inference tab lets you select any saved dataset to sample prompts from:

  1. Select a dataset from the Dataset for sample prompts dropdown
  2. 5 random prompts are loaded into the Sample prompts dropdown
  3. Click the refresh button to get new random samples
  4. Select a sample to automatically fill the prompt text area

Unlike Quick Local Inference (which only uses the training dataset), the Inference tab can test against any dataset in your database.

Adapter Validation

When you select a training run, FineFoundry:

  1. Shows a loading spinner while checking the folder
  2. Verifies the directory contains LoRA artifacts (adapter_config.json, weight files)
  3. If valid: unlocks the Prompt & Responses section
  4. If invalid: shows an error and locks the controls

Generation Controls

  • Preset dropdown – Quick settings for different use cases
  • Temperature slider – Controls randomness (0.0 = deterministic)
  • Max new tokens slider – Upper bound on generated tokens

Full Chat View

Click Full Chat View to open a focused chat dialog:

Full Chat View Screenshot
  • Large chat area with user/assistant bubbles
  • Multiline message composer
  • Shared conversation history with main view
  • Proper chat templates for multi-turn conversations
  • Clear history and close buttons

Under the Hood

Powered by the same stack as training:

  • TransformersAutoModelForCausalLM, AutoTokenizer
  • PEFTPeftModel for adapter loading
  • bitsandbytes – 4-bit quantization on CUDA
  • Chat Templates – Proper formatting for instruct models (Llama-3.1, etc.)
  • Repetition Penalty – Prevents degenerate/looping outputs
  • 100% local – No external API calls

Merge Datasets Tab

Combine multiple datasets from different sources into a unified training set.

Merge Datasets Tab Screenshot

Use Cases

  • Combining data from multiple scraping sessions
  • Merging database sessions with Hugging Face datasets (when online)
  • Creating larger, more diverse training datasets

Operations

  • Concatenate – Stack all datasets sequentially
  • Interleave – Alternate records for better distribution

Supported Sources

  • Database Session – Load from your scrape history
  • Hugging Face – Load from Hub with repo, split, and config (when online)
Offline Mode: Hugging Face dataset sources are disabled; database sessions remain available.

Column Mapping

FineFoundry automatically handles column mapping:

  • Auto-detects common patterns: input/output, prompt/response, question/answer
  • Normalizes all datasets to input/output format
  • Filters rows with empty input or output

Output

  • Database – Merged data saved to a new database session
  • Database + Export JSON – Also export to JSON for external tools

Download Merged Dataset

If you enabled JSON export, click Download Merged Dataset to copy the result to another location.

Dataset Analysis Tab

Interactive insights into your datasets to assess quality before training.

Dataset Analysis Tab Screenshot

Analysis Modules

  • Basic Stats – Record counts, mean lengths
  • Duplicates & Similarity – Approximate duplicate rate
  • Sentiment – Polarity distribution
  • Class Balance – Short/medium/long buckets
  • Data Leakage – Train/test overlap detection
  • Toxicity – Harmful content detection
  • Readability – Text complexity metrics
  • Topics – Topic distribution analysis

Workflow

  1. Select dataset source (Database Session, or Hugging Face when online)
  2. Enable the analysis modules you need
  3. Click Analyze Dataset
  4. Review summary stats and visualizations
Offline Mode: Hugging Face dataset source and Hugging Face inference backend options are disabled.
💡 Best Practice: Run analysis before committing to long training runs. Use Duplicates & Similarity to spot unintentional dataset duplication.

Settings Tab

Centralized configuration for authentication, proxies, and integrations.

Hugging Face Settings

  • HF Token – Paste your access token with read/write permissions
  • Test – Verify connectivity to Hugging Face
  • Save / Remove – Persist or clear the token

RunPod Settings

  • API Key – Your RunPod API key
  • Test – Verify the key works
  • Save / Remove – Persist or clear the key

Proxy Settings

  • Enable proxy – Toggle proxy usage for scrapers
  • Use env proxies – Use system environment variables
  • Proxy URL – e.g., socks5h://127.0.0.1:9050 for Tor

Ollama Settings (Optional)

  • Enable Ollama – Toggle Ollama integration
  • Base URL – e.g., http://localhost:11434
  • Default model – Model to use for dataset card generation
Settings are stored in a local SQLite database (finefoundry.db) and never sent to external servers.

Offline Mode

When Offline Mode is enabled, FineFoundry disables actions that require external services (Hugging Face Hub operations, Hugging Face dataset sources, and RunPod training). The UI keeps controls visible where helpful, but disables them and shows inline reasons.

Data Storage

FineFoundry uses SQLite (finefoundry.db) as the sole storage mechanism for all application data:

  • Settings – HF token, RunPod API key, Ollama config, proxy
  • Training Configs – Saved hyperparameter configurations
  • Scrape Sessions – History of all scrape runs
  • Scraped Pairs – All input/output pairs from scraping
  • Training Runs – Managed training runs with logs and metadata
  • App Logs – All application logs stored in database

The database is auto-created on first run. There are no filesystem fallbacks or legacy JSON files.

CLI Tools

Command-line tools for automation and scripting.

Dataset Build & Push

Use src/save_dataset.py to build and push datasets:

# Configure constants in the file header, then run:
uv run src/save_dataset.py

Configuration options in the file:

DATA_FILE = "scraped_training_data.json"
SAVE_DIR = "hf_dataset"
SEED = 42
SHUFFLE = True
VAL_SIZE = 0.01
TEST_SIZE = 0.0
MIN_LEN = 1
PUSH_TO_HUB = True
REPO_ID = "username/my-dataset"
PRIVATE = True
HF_TOKEN = None  # uses env HF_TOKEN if None

Reddit Scraper CLI

uv run src/scrapers/reddit_scraper.py \
  --url https://www.reddit.com/r/AskReddit/ \
  --max-posts 50 \
  --mode contextual \
  --k 4 \
  --max-input-chars 2000 \
  --pairs-path reddit_pairs.json \
  --cleanup

Important Options

  • --url – Subreddit or post URL to crawl
  • --max-posts – Maximum posts to process
  • --modeparent_child or contextual
  • --k – Context depth for contextual mode
  • --pairs-path – Output path for pairs JSON
  • --cleanup – Delete dump folder after copying pairs

When to Use CLI vs GUI

Use GUI for interactive exploration, visual feedback, and managing training runs.

Use CLI for scheduled jobs, CI integration, and reproducible configurations.

Python API

Use FineFoundry programmatically in your own scripts.

4chan Scraper

import sys
sys.path.append("src")

from scrapers.fourchan_scraper import scrape

pairs = scrape(
    board="pol",
    max_threads=150,
    max_pairs=5000,
    mode="contextual",
    strategy="cumulative"
)

# pairs is a list of {"input": ..., "output": ...} dicts

Dataset Builder

import sys
sys.path.append("src")

from db.scraped_data import get_pairs_for_session
from save_dataset import build_dataset_dict, normalize_records

# Load pairs from a database scrape session
pairs = get_pairs_for_session(session_id=1)
examples = normalize_records(pairs, min_len=1)

# Build a DatasetDict with train/validation/test splits
dd = build_dataset_dict(examples, val_size=0.05, test_size=0.0)

Local Inference

import sys
sys.path.append("src")

from helpers.local_inference import generate_text

response = generate_text(
    base_model="unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    adapter_path="/path/to/adapter",
    prompt="What is machine learning?",
    temperature=0.7,
    max_new_tokens=256,
)

Docker Deployment

Run local training jobs using Docker containers.

Overview

FineFoundry uses Docker for local training from the Training tab:

  • Training jobs run inside a Docker container
  • A host directory is mounted to /data in the container
  • Checkpoints and outputs are written to that directory

Default Trainer Image

docker.io/sbussiso/unsloth-trainer:latest

This image includes:

  • PyTorch with CUDA support
  • Hugging Face Transformers
  • bitsandbytes for 4-bit quantization
  • PEFT / LoRA via Unsloth

GPU Access

For GPU training, ensure:

  • NVIDIA drivers are installed
  • Docker is configured with NVIDIA runtime
  • Use GPU is enabled in the Training tab

Running the GUI

The GUI is designed to run on your local machine, not in a container:

# Run the GUI locally
uv run src/main.py

# Training jobs are offloaded to Docker containers

RunPod Setup

Run training jobs on remote GPUs using RunPod.

How It Works

When you select RunPod – Pod as the training target:

  1. FineFoundry connects using your RunPod API key
  2. Ensures a Network Volume exists (mounted at /data)
  3. Ensures a Pod Template exists for your hardware
  4. Launches pods to run training jobs
  5. Writes outputs to /data/outputs/... on the network volume

Prerequisites

  • RunPod account with billing/credits
  • RunPod API key (configure in Settings tab)
  • Available GPU type in your desired region

Step 1: Configure API Key

  1. Open the Settings tab
  2. Paste your API key in RunPod Settings
  3. Click Test to verify, then Save

Step 2: Create Network Volume

In the RunPod console:

  1. Create a Network Volume (size depends on your needs)
  2. Note the volume identifier
  3. In FineFoundry, use Ensure Infrastructure to verify

Step 3: Create Pod Template

Create a template that:

  • Uses docker.io/sbussiso/unsloth-trainer:latest
  • Mounts the Network Volume at /data
  • Has your desired GPU/CPU/RAM resources

Step 4: Launch Training

  1. Set Training target to RunPod – Pod
  2. Configure dataset and hyperparameters
  3. Set Output dir under /data/outputs/...
  4. Start the training job

Troubleshooting

Common issues and solutions.

Installation Issues

Python version mismatch

FineFoundry requires Python 3.10+. Check your version:

python --version

Dependency conflicts

Delete the virtual environment and let uv recreate it:

rm -rf .venv
uv run src/main.py

Training Issues

CUDA Out of Memory (OOM)

  • Reduce batch size
  • Increase gradient accumulation
  • Use a smaller base model
  • Enable packing for short examples

Exit code 137

The container was killed due to memory limits. Reduce batch size or use a machine with more RAM.

Authentication Issues

Hugging Face token not working

  • Verify the token has write permissions
  • Use the Test button in Settings
  • Try setting HF_TOKEN environment variable

RunPod API key issues

  • Verify the key in the RunPod console
  • Check that billing/credits are set up
  • Use the Test button in Settings

Inference Issues

Adapter validation fails

  • Select a completed training run (the adapter path is loaded automatically)
  • Verify the folder contains adapter_config.json
  • Check for weight files (*.safetensors or *.bin)

Getting Help

Upgrade Notes

Returning to FineFoundry after using an older version? These are the key behavior changes to know before following older tutorials.

Major changes

  • Database-first workflows: Scrape sessions, training configs, training runs, logs, and settings live in finefoundry.db.
  • Publish is database-session based: The GUI builds datasets from database scrape sessions (Hub push is optional).
  • Inference is training-run based: Select a completed training run; FineFoundry loads and validates its adapter automatically.
  • Offline Mode gating: Disables Hugging Face Hub actions, Hugging Face dataset sources, and Runpod training; Data Sources tab network sources are disabled.
  • Dependency management: The repo uses uv and pyproject.toml; requirements.txt is deprecated.

Full details (core docs): Upgrade Notes