Skip to content

blumfontein/functiongemma

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FunctionGemma 270M Finetuning Project

A beginner-friendly, containerized environment for finetuning Google's FunctionGemma 270M model using the Mobile Actions dataset.

This project replicates the official Gemma Cookbook finetuning recipe within a Docker Compose infrastructure, making it reproducible and accessible.

What is FunctionGemma?

FunctionGemma is a compact 270M parameter model designed for function calling - the ability to parse user requests and map them to specific tool/API calls. This project finetunes it on mobile device actions like opening apps, making calls, sending messages, etc.

Why this matters for beginners:

  • Small model size (270M params) means faster training and lower GPU memory requirements
  • Function calling is a practical, in-demand skill for AI applications
  • Containerized setup eliminates "it works on my machine" problems

Prerequisites

Before you begin, ensure your host machine has the following installed:

1. Docker and Docker Compose

Docker Compose v2.x or later is required for GPU resource configuration.

# Check your versions
docker --version          # Should be 20.10+
docker compose version    # Should be v2.x+

Install Docker: https://docs.docker.com/get-docker/

2. NVIDIA GPU Driver

You need a CUDA-capable GPU with compute capability 8.0+ (Ampere architecture or newer).

# Verify your GPU is recognized
nvidia-smi

Compatible GPUs include:

  • NVIDIA A100, A10, A6000 (Data center)
  • NVIDIA RTX 3080/3090/4080/4090 (Consumer)
  • Minimum 16GB VRAM recommended

Install drivers: https://www.nvidia.com/drivers

3. NVIDIA Container Toolkit

This allows Docker containers to access your GPU.

# Verify installation
docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi

If this fails, install the toolkit: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html

4. HuggingFace Account and Token

You need a HuggingFace account with:

  1. An API token with read access
  2. Gemma license agreement accepted

Getting your token:

  1. Create account at https://huggingface.co
  2. Go to https://huggingface.co/settings/tokens
  3. Create a new token with Read access
  4. Accept the Gemma license at https://huggingface.co/google/functiongemma-270m-it

Quick Start

Get training running in just a few steps:

Step 1: Clone and Enter the Project

git clone <your-repo-url>
cd functiongemma

Step 2: Configure Environment

Copy the example environment file and add your HuggingFace token:

cp .env.example .env

Edit .env and replace the placeholder with your actual token:

HF_TOKEN=hf_your_actual_token_here

Optional: Cache Control Settings

The training script supports optional environment variables to control caching behavior:

# Dataset caching (in .env file)
DATASET_MODE=cache    # Use cached dataset if available (default)
DATASET_MODE=fresh    # Force fresh download

# Model weights caching (in .env file)
WEIGHTS_MODE=cache    # Use cached trained model if available (default)
WEIGHTS_MODE=fresh    # Force retraining

These settings allow you to:

  • Skip re-downloading the dataset if it's already cached
  • Skip re-training if a trained model already exists
  • Force fresh operations when you want to start over

Step 3: Build and Run Training

docker compose up --build

This will:

  1. Build the Docker container with all dependencies
  2. Download the FunctionGemma model and Mobile Actions dataset
  3. Start the finetuning process

First run note: Initial model and dataset downloads may take 5-10 minutes depending on your connection.

Step 4: Monitor Progress

Training logs will stream to your terminal. You should see:

  • GPU validation checks passing
  • Model and dataset loading
  • Training progress with decreasing loss values

To run in background mode:

docker compose up --build -d
docker compose logs -f training

Step 5: Find Your Results

After training completes, your finetuned model will be in the ./results directory:

./results/
  checkpoint-xxx/     # Intermediate checkpoints
  final_model/        # Final finetuned model directory
    config.json       # Model configuration
    model.safetensors # Model weights
    tokenizer.json    # Tokenizer configuration

Project Structure

functiongemma/
├── docker-compose.yml    # Container orchestration with GPU passthrough
├── Dockerfile            # Training environment build instructions
├── train.py              # Main finetuning script with all training logic
├── requirements.txt      # Pinned Python dependencies for reproducibility
├── .env.example          # Template for environment variables
├── README.md             # This documentation file
├── results/              # Training outputs (created at runtime)
└── cache/                # HuggingFace cache (created at runtime)

What Each File Does

File Purpose
docker-compose.yml Defines the training service with GPU access, volume mounts, and environment variables
Dockerfile Builds the container with Python, PyTorch, CUDA, and ML libraries
train.py Loads the model, processes the dataset, configures training, and runs the training loop
requirements.txt Lists exact versions of all Python packages for reproducible builds
.env.example Shows required environment variables (copy to .env and fill in)

Understanding the Training Configuration

This section explains the key configuration choices and why they matter. Understanding these settings will help you debug issues and adapt the configuration for other models.

Model Loading (Critical Settings)

model = AutoModelForCausalLM.from_pretrained(
    'google/functiongemma-270m-it',
    device_map='auto',
    torch_dtype=torch.bfloat16,        # Required for stable training
    attn_implementation='eager'         # Required to avoid NaN gradients
)

Why attn_implementation='eager' is MANDATORY

The Problem: FunctionGemma produces NaN (Not a Number) values when using optimized attention implementations like Flash Attention 2 or SDPA (Scaled Dot-Product Attention). NaN values corrupt the entire training run - once they appear, your model is effectively broken.

What causes this? These optimized attention mechanisms use memory-efficient kernels that make assumptions about tensor layouts and numerical precision. Some model architectures (like FunctionGemma's) don't align perfectly with these assumptions, leading to numerical instability during the attention computation.

The Solution: attn_implementation='eager' forces PyTorch to use the standard, mathematically explicit attention calculation. It's slower (~10-20% performance hit) but mathematically identical and numerically stable.

How to verify: If you see loss: nan in your training logs, this is almost always the cause. Switch to 'eager' mode immediately.

Why torch_dtype=torch.bfloat16 is MANDATORY

The Problem: Using float16 (FP16) with FunctionGemma produces erratic, unstable model outputs. The model may appear to train but generates nonsensical text.

What causes this? FP16 has limited dynamic range (max ~65,504). When gradients or activations exceed this range, they overflow to infinity or underflow to zero. BFloat16 has the same range as FP32 (max ~3.4×10³⁸) but with reduced precision - a much better tradeoff for neural network training.

Why not FP32? Full 32-bit precision would work but doubles memory usage, meaning you couldn't fit the model on a 16GB GPU. BFloat16 gives you FP32's range with FP16's memory footprint.

Hardware requirement: BFloat16 requires compute capability 8.0+ (Ampere GPUs). Older GPUs like RTX 2080 don't support it natively.

Why device_map='auto'

This automatically places model layers on available GPU(s). For single-GPU setups, it just puts everything on your GPU. For multi-GPU, it distributes layers across devices. This is simpler than manual device assignment and handles edge cases automatically.

Training Hyperparameters

training_args = SFTConfig(
    num_train_epochs=2,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=8,
    learning_rate=1e-5,
    lr_scheduler_type='cosine',
    gradient_checkpointing=True,
    bf16=True,
)

These hyperparameters come from the official Gemma Cookbook recipe. Here's why each matters:

Why gradient_checkpointing=True is CRITICAL for 16GB GPUs

The Problem: During training, PyTorch saves all intermediate activations (outputs from each layer) for the backward pass. For a 270M parameter model, this can easily consume 12+ GB of VRAM on top of the model weights.

The Solution: Gradient checkpointing discards intermediate activations during the forward pass and recomputes them during the backward pass. This trades compute time (~20-30% slower) for memory savings (~40-60% reduction).

When to disable: Only if you have abundant VRAM (32GB+) and want maximum training speed.

Why gradient_accumulation_steps=8

The Problem: Large batch sizes improve training stability but require more VRAM. A batch size of 32 might not fit in memory.

The Solution: Process 4 samples at a time (per_device_train_batch_size=4), but accumulate gradients over 8 such mini-batches before updating weights. This gives you an effective batch size of 32 (4 × 8) while only needing VRAM for 4 samples.

The math: effective_batch_size = per_device_train_batch_size × gradient_accumulation_steps × num_gpus

Why learning_rate=1e-5 (0.00001)

Finetuning uses much smaller learning rates than training from scratch:

  • From scratch: ~1e-3 to 1e-4 (the model knows nothing, big steps are fine)
  • Finetuning: ~1e-5 to 1e-6 (the model already knows a lot, small adjustments only)

A learning rate of 1e-5 is conservative - it will converge reliably but may take more epochs. If training is too slow, try 2e-5 or 3e-5.

Why lr_scheduler_type='cosine'

The learning rate decreases following a cosine curve - fast at first, then gradually slower. This helps the model "lock in" its learning as training progresses. The alternative (linear decay) works too, but cosine often produces slightly better results.

Why bf16=True

Enables bfloat16 mixed precision training throughout the training loop (not just model loading). This matches our torch_dtype=torch.bfloat16 setting and ensures consistency. Using mismatched dtypes can cause subtle numerical issues.

Why num_train_epochs=2

The mobile-actions dataset has ~9,650 examples. Two epochs means seeing each example twice - enough for the model to learn the patterns without overfitting (memorizing instead of generalizing). For smaller datasets, you might use 3-5 epochs.

Troubleshooting

Common Issues

"CUDA out of memory"

Your GPU doesn't have enough VRAM. Try:

  1. Reduce per_device_train_batch_size in train.py
  2. Ensure gradient_checkpointing=True is set
  3. Close other GPU-using applications

"HF_TOKEN environment variable required"

You haven't set up your HuggingFace token:

  1. Copy .env.example to .env
  2. Add your token: HF_TOKEN=hf_xxxxx
  3. Ensure you've accepted the Gemma license

"GPU compute capability < 8.0"

Your GPU is too old. FunctionGemma training requires:

  • Ampere architecture (RTX 30xx series) or newer
  • Compute capability 8.0+

"nvidia-smi not found" inside container

NVIDIA Container Toolkit isn't properly installed:

# Test GPU access
docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi

If this fails, reinstall the Container Toolkit.

"GPU not found" or "No CUDA GPUs are available"

This means PyTorch cannot detect your GPU:

  1. Verify your GPU is recognized by the host:

    nvidia-smi

    If this fails, your NVIDIA drivers are not installed or GPU is not connected properly.

  2. Verify Docker can access the GPU:

    docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi

    If this fails, install or reinstall the NVIDIA Container Toolkit.

  3. Check Docker Compose version:

    docker compose version

    GPU passthrough requires Docker Compose v2.x+. If you're on v1.x, upgrade Docker.

  4. Restart Docker daemon after driver installation:

    sudo systemctl restart docker

Training shows NaN loss values

This shouldn't happen with correct configuration, but if it does:

  1. Verify attn_implementation='eager' in train.py
  2. Verify torch_dtype=torch.bfloat16 in train.py
  3. Try reducing learning rate

Getting Help

  1. Check the Gemma Cookbook for reference implementation
  2. Visit HuggingFace Forums for ML-specific questions
  3. Open an issue in this repository for project-specific problems

Advanced Usage

Running Interactively

To enter the container and run commands manually:

docker compose run --rm training bash
python train.py

Modifying Training Parameters

Edit train.py to change:

  • Number of epochs
  • Learning rate
  • Batch size
  • Output directory

Resuming Interrupted Training

Training automatically saves checkpoints. If interrupted:

  1. Checkpoints are in ./results/checkpoint-xxx/
  2. Modify train.py to resume from a checkpoint using trainer.train(resume_from_checkpoint=True)

Using Your Finetuned Model

After training, load your model:

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained('./results/final_model')
tokenizer = AutoTokenizer.from_pretrained('./results/final_model')

# Use for inference
inputs = tokenizer("Open the camera app", return_tensors="pt")
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0]))

References

License

This project is for educational purposes. Model usage is subject to the Gemma License.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •