FunctionGemma 270M Finetuning Project

A beginner-friendly, containerized environment for finetuning Google's FunctionGemma 270M model using the Mobile Actions dataset.

This project replicates the official Gemma Cookbook finetuning recipe within a Docker Compose infrastructure, making it reproducible and accessible.

What is FunctionGemma?

FunctionGemma is a compact 270M parameter model designed for function calling - the ability to parse user requests and map them to specific tool/API calls. This project finetunes it on mobile device actions like opening apps, making calls, sending messages, etc.

Why this matters for beginners:

Small model size (270M params) means faster training and lower GPU memory requirements
Function calling is a practical, in-demand skill for AI applications
Containerized setup eliminates "it works on my machine" problems

Prerequisites

Before you begin, ensure your host machine has the following installed:

1. Docker and Docker Compose

Docker Compose v2.x or later is required for GPU resource configuration.

# Check your versions
docker --version          # Should be 20.10+
docker compose version    # Should be v2.x+

Install Docker: https://docs.docker.com/get-docker/

2. NVIDIA GPU Driver

You need a CUDA-capable GPU with compute capability 8.0+ (Ampere architecture or newer).

# Verify your GPU is recognized
nvidia-smi

Compatible GPUs include:

NVIDIA A100, A10, A6000 (Data center)
NVIDIA RTX 3080/3090/4080/4090 (Consumer)
Minimum 16GB VRAM recommended

Install drivers: https://www.nvidia.com/drivers

3. NVIDIA Container Toolkit

This allows Docker containers to access your GPU.

# Verify installation
docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi

If this fails, install the toolkit: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html

4. HuggingFace Account and Token

You need a HuggingFace account with:

An API token with read access
Gemma license agreement accepted

Getting your token:

Create account at https://huggingface.co
Go to https://huggingface.co/settings/tokens
Create a new token with Read access
Accept the Gemma license at https://huggingface.co/google/functiongemma-270m-it

Quick Start

Get training running in just a few steps:

Step 1: Clone and Enter the Project

git clone <your-repo-url>
cd functiongemma

Step 2: Configure Environment

Copy the example environment file and add your HuggingFace token:

cp .env.example .env

Edit .env and replace the placeholder with your actual token:

HF_TOKEN=hf_your_actual_token_here

Optional: Cache Control Settings

The training script supports optional environment variables to control caching behavior:

# Dataset caching (in .env file)
DATASET_MODE=cache    # Use cached dataset if available (default)
DATASET_MODE=fresh    # Force fresh download

# Model weights caching (in .env file)
WEIGHTS_MODE=cache    # Use cached trained model if available (default)
WEIGHTS_MODE=fresh    # Force retraining

These settings allow you to:

Skip re-downloading the dataset if it's already cached
Skip re-training if a trained model already exists
Force fresh operations when you want to start over

Step 3: Build and Run Training

docker compose up --build

This will:

Build the Docker container with all dependencies
Download the FunctionGemma model and Mobile Actions dataset
Start the finetuning process

First run note: Initial model and dataset downloads may take 5-10 minutes depending on your connection.

Step 4: Monitor Progress

Training logs will stream to your terminal. You should see:

GPU validation checks passing
Model and dataset loading
Training progress with decreasing loss values

To run in background mode:

docker compose up --build -d
docker compose logs -f training

Step 5: Find Your Results

After training completes, your finetuned model will be in the ./results directory:

./results/
  checkpoint-xxx/     # Intermediate checkpoints
  final_model/        # Final finetuned model directory
    config.json       # Model configuration
    model.safetensors # Model weights
    tokenizer.json    # Tokenizer configuration

Project Structure

functiongemma/
├── docker-compose.yml    # Container orchestration with GPU passthrough
├── Dockerfile            # Training environment build instructions
├── train.py              # Main finetuning script with all training logic
├── requirements.txt      # Pinned Python dependencies for reproducibility
├── .env.example          # Template for environment variables
├── README.md             # This documentation file
├── results/              # Training outputs (created at runtime)
└── cache/                # HuggingFace cache (created at runtime)

What Each File Does

File	Purpose
`docker-compose.yml`	Defines the training service with GPU access, volume mounts, and environment variables
`Dockerfile`	Builds the container with Python, PyTorch, CUDA, and ML libraries
`train.py`	Loads the model, processes the dataset, configures training, and runs the training loop
`requirements.txt`	Lists exact versions of all Python packages for reproducible builds
`.env.example`	Shows required environment variables (copy to `.env` and fill in)

Understanding the Training Configuration

This section explains the key configuration choices and why they matter. Understanding these settings will help you debug issues and adapt the configuration for other models.

Model Loading (Critical Settings)

model = AutoModelForCausalLM.from_pretrained(
    'google/functiongemma-270m-it',
    device_map='auto',
    torch_dtype=torch.bfloat16,        # Required for stable training
    attn_implementation='eager'         # Required to avoid NaN gradients
)

Why `attn_implementation='eager'` is MANDATORY

The Problem: FunctionGemma produces NaN (Not a Number) values when using optimized attention implementations like Flash Attention 2 or SDPA (Scaled Dot-Product Attention). NaN values corrupt the entire training run - once they appear, your model is effectively broken.

What causes this? These optimized attention mechanisms use memory-efficient kernels that make assumptions about tensor layouts and numerical precision. Some model architectures (like FunctionGemma's) don't align perfectly with these assumptions, leading to numerical instability during the attention computation.

The Solution: attn_implementation='eager' forces PyTorch to use the standard, mathematically explicit attention calculation. It's slower (~10-20% performance hit) but mathematically identical and numerically stable.

How to verify: If you see loss: nan in your training logs, this is almost always the cause. Switch to 'eager' mode immediately.

Why `torch_dtype=torch.bfloat16` is MANDATORY

The Problem: Using float16 (FP16) with FunctionGemma produces erratic, unstable model outputs. The model may appear to train but generates nonsensical text.

What causes this? FP16 has limited dynamic range (max ~65,504). When gradients or activations exceed this range, they overflow to infinity or underflow to zero. BFloat16 has the same range as FP32 (max ~3.4×10³⁸) but with reduced precision - a much better tradeoff for neural network training.

Why not FP32? Full 32-bit precision would work but doubles memory usage, meaning you couldn't fit the model on a 16GB GPU. BFloat16 gives you FP32's range with FP16's memory footprint.

Hardware requirement: BFloat16 requires compute capability 8.0+ (Ampere GPUs). Older GPUs like RTX 2080 don't support it natively.

Why `device_map='auto'`

This automatically places model layers on available GPU(s). For single-GPU setups, it just puts everything on your GPU. For multi-GPU, it distributes layers across devices. This is simpler than manual device assignment and handles edge cases automatically.

Training Hyperparameters

training_args = SFTConfig(
    num_train_epochs=2,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=8,
    learning_rate=1e-5,
    lr_scheduler_type='cosine',
    gradient_checkpointing=True,
    bf16=True,
)

These hyperparameters come from the official Gemma Cookbook recipe. Here's why each matters:

Why `gradient_checkpointing=True` is CRITICAL for 16GB GPUs

The Problem: During training, PyTorch saves all intermediate activations (outputs from each layer) for the backward pass. For a 270M parameter model, this can easily consume 12+ GB of VRAM on top of the model weights.

The Solution: Gradient checkpointing discards intermediate activations during the forward pass and recomputes them during the backward pass. This trades compute time (~20-30% slower) for memory savings (~40-60% reduction).

When to disable: Only if you have abundant VRAM (32GB+) and want maximum training speed.

Why `gradient_accumulation_steps=8`

The Problem: Large batch sizes improve training stability but require more VRAM. A batch size of 32 might not fit in memory.

The Solution: Process 4 samples at a time (per_device_train_batch_size=4), but accumulate gradients over 8 such mini-batches before updating weights. This gives you an effective batch size of 32 (4 × 8) while only needing VRAM for 4 samples.

The math: effective_batch_size = per_device_train_batch_size × gradient_accumulation_steps × num_gpus

Why `learning_rate=1e-5` (0.00001)

Finetuning uses much smaller learning rates than training from scratch:

From scratch: ~1e-3 to 1e-4 (the model knows nothing, big steps are fine)
Finetuning: ~1e-5 to 1e-6 (the model already knows a lot, small adjustments only)

A learning rate of 1e-5 is conservative - it will converge reliably but may take more epochs. If training is too slow, try 2e-5 or 3e-5.

Why `lr_scheduler_type='cosine'`

The learning rate decreases following a cosine curve - fast at first, then gradually slower. This helps the model "lock in" its learning as training progresses. The alternative (linear decay) works too, but cosine often produces slightly better results.

Why `bf16=True`

Enables bfloat16 mixed precision training throughout the training loop (not just model loading). This matches our torch_dtype=torch.bfloat16 setting and ensures consistency. Using mismatched dtypes can cause subtle numerical issues.

Why `num_train_epochs=2`

The mobile-actions dataset has ~9,650 examples. Two epochs means seeing each example twice - enough for the model to learn the patterns without overfitting (memorizing instead of generalizing). For smaller datasets, you might use 3-5 epochs.

Troubleshooting

Common Issues

"CUDA out of memory"

Your GPU doesn't have enough VRAM. Try:

Reduce per_device_train_batch_size in train.py
Ensure gradient_checkpointing=True is set
Close other GPU-using applications

"HF_TOKEN environment variable required"

You haven't set up your HuggingFace token:

Copy .env.example to .env
Add your token: HF_TOKEN=hf_xxxxx
Ensure you've accepted the Gemma license

"GPU compute capability < 8.0"

Your GPU is too old. FunctionGemma training requires:

Ampere architecture (RTX 30xx series) or newer
Compute capability 8.0+

"nvidia-smi not found" inside container

NVIDIA Container Toolkit isn't properly installed:

# Test GPU access
docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi

If this fails, reinstall the Container Toolkit.

"GPU not found" or "No CUDA GPUs are available"

This means PyTorch cannot detect your GPU:

Verify your GPU is recognized by the host:
```
nvidia-smi
```
If this fails, your NVIDIA drivers are not installed or GPU is not connected properly.
Verify Docker can access the GPU:
```
docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi
```
If this fails, install or reinstall the NVIDIA Container Toolkit.
Check Docker Compose version:
```
docker compose version
```
GPU passthrough requires Docker Compose v2.x+. If you're on v1.x, upgrade Docker.
Restart Docker daemon after driver installation:
```
sudo systemctl restart docker
```

Training shows NaN loss values

This shouldn't happen with correct configuration, but if it does:

Verify attn_implementation='eager' in train.py
Verify torch_dtype=torch.bfloat16 in train.py
Try reducing learning rate

Getting Help

Check the Gemma Cookbook for reference implementation
Visit HuggingFace Forums for ML-specific questions
Open an issue in this repository for project-specific problems

Advanced Usage

Running Interactively

To enter the container and run commands manually:

docker compose run --rm training bash
python train.py

Modifying Training Parameters

Edit train.py to change:

Number of epochs
Learning rate
Batch size
Output directory

Resuming Interrupted Training

Training automatically saves checkpoints. If interrupted:

Checkpoints are in ./results/checkpoint-xxx/
Modify train.py to resume from a checkpoint using trainer.train(resume_from_checkpoint=True)

Using Your Finetuned Model

After training, load your model:

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained('./results/final_model')
tokenizer = AutoTokenizer.from_pretrained('./results/final_model')

# Use for inference
inputs = tokenizer("Open the camera app", return_tensors="pt")
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0]))

References

License

This project is for educational purposes. Model usage is subject to the Gemma License.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt
train.py		train.py

blumfontein/functiongemma

Folders and files

Latest commit

History

Repository files navigation

FunctionGemma 270M Finetuning Project

What is FunctionGemma?

Prerequisites

1. Docker and Docker Compose

2. NVIDIA GPU Driver

3. NVIDIA Container Toolkit

4. HuggingFace Account and Token

Quick Start

Step 1: Clone and Enter the Project

Step 2: Configure Environment

Optional: Cache Control Settings

Step 3: Build and Run Training

Step 4: Monitor Progress

Step 5: Find Your Results

Project Structure

What Each File Does

Understanding the Training Configuration

Model Loading (Critical Settings)

Why attn_implementation='eager' is MANDATORY

Why torch_dtype=torch.bfloat16 is MANDATORY

Why device_map='auto'

Training Hyperparameters

Why gradient_checkpointing=True is CRITICAL for 16GB GPUs

Why gradient_accumulation_steps=8

Why learning_rate=1e-5 (0.00001)

Why lr_scheduler_type='cosine'

Why bf16=True

Why num_train_epochs=2