Quick Start Guide
Get from cloned repo to a running desktop app and your first dataset in minutes.
Prerequisites
- Python 3.10+ (Windows, macOS, or Linux)
- Git (optional, for cloning)
- uv (recommended) or pip for package management
Optional for publishing:
Option 1: Using uv (Recommended)
# Clone the repository
git clone https://github.com/SourceBox-LLC/FineFoundry.git FineFoundry-Core
cd FineFoundry-Core
# Install uv if needed
pip install uv
# Run the application (uv handles dependencies automatically)
uv run src/main.py
Option 2: Using pip
# Clone the repository
git clone https://github.com/SourceBox-LLC/FineFoundry.git FineFoundry-Core
cd FineFoundry-Core
# Create and activate virtual environment
python -m venv venv
# Windows (PowerShell)
./venv/Scripts/Activate.ps1
# macOS/Linux
source venv/bin/activate
# Install dependencies
pip install -e .
# Run the application
python src/main.py
First Launch
When you launch FineFoundry, you'll see a desktop application with these tabs:
- Data Sources – Collect training data from 4chan, Reddit, or Stack Exchange
- Dataset Analysis – Analyze dataset quality
- Merge Datasets – Combine multiple datasets
- Training – Fine-tune models on RunPod or locally via Docker
- Inference – Run inference against adapters from completed training runs
- Publish – Publish datasets and LoRA adapters to Hugging Face
- Settings – Configure authentication and preferences
Your First Dataset
Step 1: Scrape Data
- Navigate to the Data Sources tab
- Select a few boards (e.g.,
pol, b, x)
- Configure parameters:
- Max Threads: 50
- Max Pairs: 500
- Delay: 0.5 seconds
- Min Length: 10 characters
- Click Start
- When complete, click Preview Dataset
Step 2: Publish (Optional)
- Navigate to the Publish tab
- Configure split ratios with sliders
- Click Build Dataset
- To publish: enable Push to Hub, set Repo ID, and click Push + Upload README
Installation
Detailed installation instructions for all platforms.
System Requirements
- OS: Windows 10+, macOS 11+, or Linux (Ubuntu 20.04+)
- Python: 3.10 or higher
- RAM: 8GB minimum, 16GB+ recommended for training
- GPU: Optional but recommended for local training (NVIDIA with CUDA support)
Installing uv
uv is a fast Python package manager that handles dependencies automatically:
# Install uv
pip install uv
# Verify installation
uv --version
Verifying Installation
# Check Python version
python --version # Should be 3.10+
# Run FineFoundry
uv run src/main.py
💡 Tip: If you encounter dependency issues, try deleting the .venv folder and running uv run src/main.py again. uv will recreate the environment with fresh dependencies.
Data Sources Tab
Collect conversational training data from multiple sources and prepare it as input/output pairs.
Supported Sources
4chan
Multi-board scraping with quote-chain and cumulative pairing modes
Reddit
Subreddits or single posts with parent-child threading
Stack Exchange
Q&A pairs from accepted answers
Synthetic
Generate Q&A, CoT, or summaries from PDFs/docs using local LLMs
Parameters
- Max Threads – Number of threads per board to sample
- Max Pairs – Upper bound on input/output pairs to extract
- Delay (s) – Polite delay between HTTP requests
- Min Length – Minimum character count per side
- Mode –
normal (adjacent posts) or contextual
- Strategy (contextual only) –
quote_chain, cumulative, or last_k
- K – Context depth for contextual mode
- Max Input Chars – Optional truncation of long contexts
Pairing Modes
Normal Mode
Creates pairs from adjacent posts. Simple and fast, but loses conversational context.
Contextual Mode
Builds context from the conversation thread:
- quote_chain – Follows reply chains via quote references
- cumulative – Accumulates all previous posts as context
- last_k – Uses the last K posts as context
Output Format
[
{"input": "What do you think about...", "output": "I believe that..."},
{"input": "Can you explain...", "output": "Sure, here's how..."}
]
💡 Best Practice: Start with smaller runs (50 threads, 500 pairs) to validate your configuration before scaling up.
Offline Mode: Only the Synthetic data source can be used. Network sources (4chan/Reddit/Stack Exchange) are disabled.
Synthetic Data Generation
Generate training data from your own documents using local LLMs powered by Unsloth's SyntheticDataKit.
Supported Input Formats
- PDF documents
- DOCX (Word documents)
- PPTX (PowerPoint)
- HTML/HTM web pages
- TXT plain text
- URLs (fetched and parsed)
Generation Types
- qa – Question-answer pairs from document content
- cot – Chain-of-thought reasoning examples
- summary – Document summaries
Synthetic Parameters
- Model – Local LLM to use (default:
unsloth/Llama-3.2-3B-Instruct)
- Generation Type – qa, cot, or summary
- Num Pairs – Target examples per chunk
- Max Chunks – Maximum document chunks to process
- Curate – Enable quality filtering with threshold
💡 Note: First run takes 30-60 seconds for model loading. A snackbar notification appears immediately when you click Start. Subsequent runs are faster.
Publish Tab
Publish datasets and (Phase 1) LoRA adapters to the Hugging Face Hub.
Workflow
- Select a Database Session from your scrape history
- Configure split ratios (train/validation/test)
- Set shuffle and seed for reproducibility
- Click Build Dataset
- Optionally enable Push to Hub and publish
Publish model (adapter)
If you have a completed training run, you can publish its LoRA adapter:
- Select a completed training run
- Set Model repo ID and privacy
- Click Publish adapter
Split Configuration
- Seed – Controls shuffling deterministically
- Shuffle – Whether to shuffle before splitting
- Validation % – Fraction for validation set
- Test % – Fraction for test set (remainder becomes train)
- Min Length – Minimum characters for input/output
Hub Integration
- Repo ID – e.g.,
username/my-dataset
- Private – Create a private repository
- HF Token – Your Hugging Face access token
The Push + Upload README button uploads your dataset with an auto-generated dataset card.
Example: Local Splits Only
- Select your database session from the dropdown
- Set Validation to
0.05, Test to 0.0
- Enable Shuffle, set Seed to
42
- Set Save dir to
hf_dataset
- Click Build Dataset
Offline Mode: Hugging Face Hub actions are disabled (dataset push and adapter publishing).
Training Tab
Fine-tune language models using an Unsloth-based LoRA training stack on RunPod or locally via Docker.
Training Targets
RunPod
Cloud GPU training with automated pod and network volume management
Local Docker
Train on your local GPU using the same Unsloth trainer image
Under the Hood
Both targets use docker.io/sbussiso/unsloth-trainer:latest with:
- PyTorch – Accelerated training on CPU/GPU
- Hugging Face Transformers – Model loading and tokenization
- bitsandbytes – 4-bit quantization for memory efficiency
- PEFT / LoRA – Parameter-efficient fine-tuning via Unsloth
Skill Levels
Beginner Mode
Simplifies choices with safe presets:
- Fastest (RunPod) – Higher throughput on stronger GPUs
- Cheapest (RunPod) – Conservative params for smaller GPUs
- Quick local test – Short run for sanity checks
- Auto Set (local) – Detects GPU VRAM and aggressively pushes throughput while still aiming to avoid OOM
- Simple custom – Guided controls for duration, memory/stability, and speed vs quality
Expert Mode
Full control over all hyperparameters for experienced users.
Hyperparameters
- Base model – Default:
unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit
- Epochs – Number of training epochs
- Learning rate – Step size for optimization
- Batch size – Samples per device per step
- Gradient accumulation – Steps before weight update
- Max steps – Upper bound on training steps
- Packing – Pack multiple short examples for throughput
- Auto-resume – Continue from latest checkpoint
Quick Local Inference
After a successful local run, the Quick Local Inference panel appears:
- Sample prompts – 5 random prompts from your training dataset are auto-loaded for quick testing
- Enter a prompt manually or select a sample from the dropdown
- Choose presets: Deterministic, Balanced, or Creative
- Adjust temperature and max tokens with sliders
- View prompt/response history
- Export chats – Save your prompt/response history to a text file
The sample prompts feature lets you quickly verify your model learned from the training data without manually copying prompts.
Saving Configurations
Training configs are stored in the SQLite database:
- Click Save current setup to snapshot your configuration
- Use the dropdown to load saved configs
- The last used config auto-loads on startup
- All configs persist across sessions in
finefoundry.db
Offline Mode: RunPod training and Hugging Face Hub actions are disabled; local-only workflows are enforced.
Inference Tab
Run local inference against adapters from completed training runs with prompt history and Full Chat View.
Features
- Training run selection – Choose a completed training run (FineFoundry loads the adapter path automatically)
- Instant Validation – Verifies adapter files before loading
- Dataset Selector – Choose any saved dataset to sample prompts from
- Sample Prompts – Get 5 random prompts from selected dataset for quick testing
- Generation Presets – Deterministic, Balanced, Creative, or Custom
- Full Chat View – Multi-turn conversation dialog with proper chat templates
- Export Chats – Save prompt/response history to text files
- Prompt History – Scroll through previous prompts and responses
Sample Prompts
The Inference tab lets you select any saved dataset to sample prompts from:
- Select a dataset from the Dataset for sample prompts dropdown
- 5 random prompts are loaded into the Sample prompts dropdown
- Click the refresh button to get new random samples
- Select a sample to automatically fill the prompt text area
Unlike Quick Local Inference (which only uses the training dataset), the Inference tab can test against any dataset in your database.
Adapter Validation
When you select a training run, FineFoundry:
- Shows a loading spinner while checking the folder
- Verifies the directory contains LoRA artifacts (
adapter_config.json, weight files)
- If valid: unlocks the Prompt & Responses section
- If invalid: shows an error and locks the controls
Generation Controls
- Preset dropdown – Quick settings for different use cases
- Temperature slider – Controls randomness (0.0 = deterministic)
- Max new tokens slider – Upper bound on generated tokens
Full Chat View
Click Full Chat View to open a focused chat dialog:
- Large chat area with user/assistant bubbles
- Multiline message composer
- Shared conversation history with main view
- Proper chat templates for multi-turn conversations
- Clear history and close buttons
Under the Hood
Powered by the same stack as training:
- Transformers –
AutoModelForCausalLM, AutoTokenizer
- PEFT –
PeftModel for adapter loading
- bitsandbytes – 4-bit quantization on CUDA
- Chat Templates – Proper formatting for instruct models (Llama-3.1, etc.)
- Repetition Penalty – Prevents degenerate/looping outputs
- 100% local – No external API calls
Merge Datasets Tab
Combine multiple datasets from different sources into a unified training set.
Use Cases
- Combining data from multiple scraping sessions
- Merging database sessions with Hugging Face datasets (when online)
- Creating larger, more diverse training datasets
Operations
- Concatenate – Stack all datasets sequentially
- Interleave – Alternate records for better distribution
Supported Sources
- Database Session – Load from your scrape history
- Hugging Face – Load from Hub with repo, split, and config (when online)
Offline Mode: Hugging Face dataset sources are disabled; database sessions remain available.
Column Mapping
FineFoundry automatically handles column mapping:
- Auto-detects common patterns:
input/output, prompt/response, question/answer
- Normalizes all datasets to
input/output format
- Filters rows with empty input or output
Output
- Database – Merged data saved to a new database session
- Database + Export JSON – Also export to JSON for external tools
Download Merged Dataset
If you enabled JSON export, click Download Merged Dataset to copy the result to another location.
Dataset Analysis Tab
Interactive insights into your datasets to assess quality before training.
Analysis Modules
- Basic Stats – Record counts, mean lengths
- Duplicates & Similarity – Approximate duplicate rate
- Sentiment – Polarity distribution
- Class Balance – Short/medium/long buckets
- Data Leakage – Train/test overlap detection
- Toxicity – Harmful content detection
- Readability – Text complexity metrics
- Topics – Topic distribution analysis
Workflow
- Select dataset source (Database Session, or Hugging Face when online)
- Enable the analysis modules you need
- Click Analyze Dataset
- Review summary stats and visualizations
Offline Mode: Hugging Face dataset source and Hugging Face inference backend options are disabled.
💡 Best Practice: Run analysis before committing to long training runs. Use Duplicates & Similarity to spot unintentional dataset duplication.
Settings Tab
Centralized configuration for authentication, proxies, and integrations.
Hugging Face Settings
- HF Token – Paste your access token with read/write permissions
- Test – Verify connectivity to Hugging Face
- Save / Remove – Persist or clear the token
RunPod Settings
- API Key – Your RunPod API key
- Test – Verify the key works
- Save / Remove – Persist or clear the key
Proxy Settings
- Enable proxy – Toggle proxy usage for scrapers
- Use env proxies – Use system environment variables
- Proxy URL – e.g.,
socks5h://127.0.0.1:9050 for Tor
Ollama Settings (Optional)
- Enable Ollama – Toggle Ollama integration
- Base URL – e.g.,
http://localhost:11434
- Default model – Model to use for dataset card generation
Settings are stored in a local SQLite database (finefoundry.db) and never sent to external servers.
Offline Mode
When Offline Mode is enabled, FineFoundry disables actions that require external services (Hugging Face Hub operations, Hugging Face dataset sources, and RunPod training). The UI keeps controls visible where helpful, but disables them and shows inline reasons.
Data Storage
FineFoundry uses SQLite (finefoundry.db) as the sole storage mechanism for all application data:
- Settings – HF token, RunPod API key, Ollama config, proxy
- Training Configs – Saved hyperparameter configurations
- Scrape Sessions – History of all scrape runs
- Scraped Pairs – All input/output pairs from scraping
- Training Runs – Managed training runs with logs and metadata
- App Logs – All application logs stored in database
The database is auto-created on first run. There are no filesystem fallbacks or legacy JSON files.
CLI Tools
Command-line tools for automation and scripting.
Dataset Build & Push
Use src/save_dataset.py to build and push datasets:
# Configure constants in the file header, then run:
uv run src/save_dataset.py
Configuration options in the file:
DATA_FILE = "scraped_training_data.json"
SAVE_DIR = "hf_dataset"
SEED = 42
SHUFFLE = True
VAL_SIZE = 0.01
TEST_SIZE = 0.0
MIN_LEN = 1
PUSH_TO_HUB = True
REPO_ID = "username/my-dataset"
PRIVATE = True
HF_TOKEN = None # uses env HF_TOKEN if None
Reddit Scraper CLI
uv run src/scrapers/reddit_scraper.py \
--url https://www.reddit.com/r/AskReddit/ \
--max-posts 50 \
--mode contextual \
--k 4 \
--max-input-chars 2000 \
--pairs-path reddit_pairs.json \
--cleanup
Important Options
--url – Subreddit or post URL to crawl
--max-posts – Maximum posts to process
--mode – parent_child or contextual
--k – Context depth for contextual mode
--pairs-path – Output path for pairs JSON
--cleanup – Delete dump folder after copying pairs
When to Use CLI vs GUI
Use GUI for interactive exploration, visual feedback, and managing training runs.
Use CLI for scheduled jobs, CI integration, and reproducible configurations.
Python API
Use FineFoundry programmatically in your own scripts.
4chan Scraper
import sys
sys.path.append("src")
from scrapers.fourchan_scraper import scrape
pairs = scrape(
board="pol",
max_threads=150,
max_pairs=5000,
mode="contextual",
strategy="cumulative"
)
# pairs is a list of {"input": ..., "output": ...} dicts
Dataset Builder
import sys
sys.path.append("src")
from db.scraped_data import get_pairs_for_session
from save_dataset import build_dataset_dict, normalize_records
# Load pairs from a database scrape session
pairs = get_pairs_for_session(session_id=1)
examples = normalize_records(pairs, min_len=1)
# Build a DatasetDict with train/validation/test splits
dd = build_dataset_dict(examples, val_size=0.05, test_size=0.0)
Local Inference
import sys
sys.path.append("src")
from helpers.local_inference import generate_text
response = generate_text(
base_model="unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
adapter_path="/path/to/adapter",
prompt="What is machine learning?",
temperature=0.7,
max_new_tokens=256,
)
Docker Deployment
Run local training jobs using Docker containers.
Overview
FineFoundry uses Docker for local training from the Training tab:
- Training jobs run inside a Docker container
- A host directory is mounted to
/data in the container
- Checkpoints and outputs are written to that directory
Default Trainer Image
docker.io/sbussiso/unsloth-trainer:latest
This image includes:
- PyTorch with CUDA support
- Hugging Face Transformers
- bitsandbytes for 4-bit quantization
- PEFT / LoRA via Unsloth
GPU Access
For GPU training, ensure:
- NVIDIA drivers are installed
- Docker is configured with NVIDIA runtime
- Use GPU is enabled in the Training tab
Running the GUI
The GUI is designed to run on your local machine, not in a container:
# Run the GUI locally
uv run src/main.py
# Training jobs are offloaded to Docker containers
RunPod Setup
Run training jobs on remote GPUs using RunPod.
How It Works
When you select RunPod – Pod as the training target:
- FineFoundry connects using your RunPod API key
- Ensures a Network Volume exists (mounted at
/data)
- Ensures a Pod Template exists for your hardware
- Launches pods to run training jobs
- Writes outputs to
/data/outputs/... on the network volume
Prerequisites
- RunPod account with billing/credits
- RunPod API key (configure in Settings tab)
- Available GPU type in your desired region
Step 1: Configure API Key
- Open the Settings tab
- Paste your API key in RunPod Settings
- Click Test to verify, then Save
Step 2: Create Network Volume
In the RunPod console:
- Create a Network Volume (size depends on your needs)
- Note the volume identifier
- In FineFoundry, use Ensure Infrastructure to verify
Step 3: Create Pod Template
Create a template that:
- Uses
docker.io/sbussiso/unsloth-trainer:latest
- Mounts the Network Volume at
/data
- Has your desired GPU/CPU/RAM resources
Step 4: Launch Training
- Set Training target to RunPod – Pod
- Configure dataset and hyperparameters
- Set Output dir under
/data/outputs/...
- Start the training job
Troubleshooting
Common issues and solutions.
Installation Issues
Python version mismatch
FineFoundry requires Python 3.10+. Check your version:
python --version
Dependency conflicts
Delete the virtual environment and let uv recreate it:
rm -rf .venv
uv run src/main.py
Training Issues
CUDA Out of Memory (OOM)
- Reduce batch size
- Increase gradient accumulation
- Use a smaller base model
- Enable packing for short examples
Exit code 137
The container was killed due to memory limits. Reduce batch size or use a machine with more RAM.
Authentication Issues
Hugging Face token not working
- Verify the token has write permissions
- Use the Test button in Settings
- Try setting
HF_TOKEN environment variable
RunPod API key issues
- Verify the key in the RunPod console
- Check that billing/credits are set up
- Use the Test button in Settings
Inference Issues
Adapter validation fails
- Select a completed training run (the adapter path is loaded automatically)
- Verify the folder contains
adapter_config.json
- Check for weight files (
*.safetensors or *.bin)
Getting Help
Upgrade Notes
Returning to FineFoundry after using an older version? These are the key behavior changes to know before following older tutorials.
Major changes
- Database-first workflows: Scrape sessions, training configs, training runs, logs, and settings live in
finefoundry.db.
- Publish is database-session based: The GUI builds datasets from database scrape sessions (Hub push is optional).
- Inference is training-run based: Select a completed training run; FineFoundry loads and validates its adapter automatically.
- Offline Mode gating: Disables Hugging Face Hub actions, Hugging Face dataset sources, and Runpod training; Data Sources tab network sources are disabled.
- Dependency management: The repo uses
uv and pyproject.toml; requirements.txt is deprecated.
Full details (core docs):
Upgrade Notes