A beginner-friendly, containerized environment for finetuning Google's FunctionGemma 270M model using the Mobile Actions dataset.
This project replicates the official Gemma Cookbook finetuning recipe within a Docker Compose infrastructure, making it reproducible and accessible.
FunctionGemma is a compact 270M parameter model designed for function calling - the ability to parse user requests and map them to specific tool/API calls. This project finetunes it on mobile device actions like opening apps, making calls, sending messages, etc.
Why this matters for beginners:
- Small model size (270M params) means faster training and lower GPU memory requirements
- Function calling is a practical, in-demand skill for AI applications
- Containerized setup eliminates "it works on my machine" problems
Before you begin, ensure your host machine has the following installed:
Docker Compose v2.x or later is required for GPU resource configuration.
# Check your versions
docker --version # Should be 20.10+
docker compose version # Should be v2.x+Install Docker: https://docs.docker.com/get-docker/
You need a CUDA-capable GPU with compute capability 8.0+ (Ampere architecture or newer).
# Verify your GPU is recognized
nvidia-smiCompatible GPUs include:
- NVIDIA A100, A10, A6000 (Data center)
- NVIDIA RTX 3080/3090/4080/4090 (Consumer)
- Minimum 16GB VRAM recommended
Install drivers: https://www.nvidia.com/drivers
This allows Docker containers to access your GPU.
# Verify installation
docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smiIf this fails, install the toolkit: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html
You need a HuggingFace account with:
- An API token with read access
- Gemma license agreement accepted
Getting your token:
- Create account at https://huggingface.co
- Go to https://huggingface.co/settings/tokens
- Create a new token with Read access
- Accept the Gemma license at https://huggingface.co/google/functiongemma-270m-it
Get training running in just a few steps:
git clone <your-repo-url>
cd functiongemmaCopy the example environment file and add your HuggingFace token:
cp .env.example .envEdit .env and replace the placeholder with your actual token:
HF_TOKEN=hf_your_actual_token_hereThe training script supports optional environment variables to control caching behavior:
# Dataset caching (in .env file)
DATASET_MODE=cache # Use cached dataset if available (default)
DATASET_MODE=fresh # Force fresh download
# Model weights caching (in .env file)
WEIGHTS_MODE=cache # Use cached trained model if available (default)
WEIGHTS_MODE=fresh # Force retrainingThese settings allow you to:
- Skip re-downloading the dataset if it's already cached
- Skip re-training if a trained model already exists
- Force fresh operations when you want to start over
docker compose up --buildThis will:
- Build the Docker container with all dependencies
- Download the FunctionGemma model and Mobile Actions dataset
- Start the finetuning process
First run note: Initial model and dataset downloads may take 5-10 minutes depending on your connection.
Training logs will stream to your terminal. You should see:
- GPU validation checks passing
- Model and dataset loading
- Training progress with decreasing loss values
To run in background mode:
docker compose up --build -d
docker compose logs -f trainingAfter training completes, your finetuned model will be in the ./results directory:
./results/
checkpoint-xxx/ # Intermediate checkpoints
final_model/ # Final finetuned model directory
config.json # Model configuration
model.safetensors # Model weights
tokenizer.json # Tokenizer configuration
functiongemma/
├── docker-compose.yml # Container orchestration with GPU passthrough
├── Dockerfile # Training environment build instructions
├── train.py # Main finetuning script with all training logic
├── requirements.txt # Pinned Python dependencies for reproducibility
├── .env.example # Template for environment variables
├── README.md # This documentation file
├── results/ # Training outputs (created at runtime)
└── cache/ # HuggingFace cache (created at runtime)
| File | Purpose |
|---|---|
docker-compose.yml |
Defines the training service with GPU access, volume mounts, and environment variables |
Dockerfile |
Builds the container with Python, PyTorch, CUDA, and ML libraries |
train.py |
Loads the model, processes the dataset, configures training, and runs the training loop |
requirements.txt |
Lists exact versions of all Python packages for reproducible builds |
.env.example |
Shows required environment variables (copy to .env and fill in) |
This section explains the key configuration choices and why they matter. Understanding these settings will help you debug issues and adapt the configuration for other models.
model = AutoModelForCausalLM.from_pretrained(
'google/functiongemma-270m-it',
device_map='auto',
torch_dtype=torch.bfloat16, # Required for stable training
attn_implementation='eager' # Required to avoid NaN gradients
)The Problem: FunctionGemma produces NaN (Not a Number) values when using optimized attention implementations like Flash Attention 2 or SDPA (Scaled Dot-Product Attention). NaN values corrupt the entire training run - once they appear, your model is effectively broken.
What causes this? These optimized attention mechanisms use memory-efficient kernels that make assumptions about tensor layouts and numerical precision. Some model architectures (like FunctionGemma's) don't align perfectly with these assumptions, leading to numerical instability during the attention computation.
The Solution: attn_implementation='eager' forces PyTorch to use the standard, mathematically explicit attention calculation. It's slower (~10-20% performance hit) but mathematically identical and numerically stable.
How to verify: If you see loss: nan in your training logs, this is almost always the cause. Switch to 'eager' mode immediately.
The Problem: Using float16 (FP16) with FunctionGemma produces erratic, unstable model outputs. The model may appear to train but generates nonsensical text.
What causes this? FP16 has limited dynamic range (max ~65,504). When gradients or activations exceed this range, they overflow to infinity or underflow to zero. BFloat16 has the same range as FP32 (max ~3.4×10³⁸) but with reduced precision - a much better tradeoff for neural network training.
Why not FP32? Full 32-bit precision would work but doubles memory usage, meaning you couldn't fit the model on a 16GB GPU. BFloat16 gives you FP32's range with FP16's memory footprint.
Hardware requirement: BFloat16 requires compute capability 8.0+ (Ampere GPUs). Older GPUs like RTX 2080 don't support it natively.
This automatically places model layers on available GPU(s). For single-GPU setups, it just puts everything on your GPU. For multi-GPU, it distributes layers across devices. This is simpler than manual device assignment and handles edge cases automatically.
training_args = SFTConfig(
num_train_epochs=2,
per_device_train_batch_size=4,
gradient_accumulation_steps=8,
learning_rate=1e-5,
lr_scheduler_type='cosine',
gradient_checkpointing=True,
bf16=True,
)These hyperparameters come from the official Gemma Cookbook recipe. Here's why each matters:
The Problem: During training, PyTorch saves all intermediate activations (outputs from each layer) for the backward pass. For a 270M parameter model, this can easily consume 12+ GB of VRAM on top of the model weights.
The Solution: Gradient checkpointing discards intermediate activations during the forward pass and recomputes them during the backward pass. This trades compute time (~20-30% slower) for memory savings (~40-60% reduction).
When to disable: Only if you have abundant VRAM (32GB+) and want maximum training speed.
The Problem: Large batch sizes improve training stability but require more VRAM. A batch size of 32 might not fit in memory.
The Solution: Process 4 samples at a time (per_device_train_batch_size=4), but accumulate gradients over 8 such mini-batches before updating weights. This gives you an effective batch size of 32 (4 × 8) while only needing VRAM for 4 samples.
The math: effective_batch_size = per_device_train_batch_size × gradient_accumulation_steps × num_gpus
Finetuning uses much smaller learning rates than training from scratch:
- From scratch: ~1e-3 to 1e-4 (the model knows nothing, big steps are fine)
- Finetuning: ~1e-5 to 1e-6 (the model already knows a lot, small adjustments only)
A learning rate of 1e-5 is conservative - it will converge reliably but may take more epochs. If training is too slow, try 2e-5 or 3e-5.
The learning rate decreases following a cosine curve - fast at first, then gradually slower. This helps the model "lock in" its learning as training progresses. The alternative (linear decay) works too, but cosine often produces slightly better results.
Enables bfloat16 mixed precision training throughout the training loop (not just model loading). This matches our torch_dtype=torch.bfloat16 setting and ensures consistency. Using mismatched dtypes can cause subtle numerical issues.
The mobile-actions dataset has ~9,650 examples. Two epochs means seeing each example twice - enough for the model to learn the patterns without overfitting (memorizing instead of generalizing). For smaller datasets, you might use 3-5 epochs.
Your GPU doesn't have enough VRAM. Try:
- Reduce
per_device_train_batch_sizeintrain.py - Ensure
gradient_checkpointing=Trueis set - Close other GPU-using applications
You haven't set up your HuggingFace token:
- Copy
.env.exampleto.env - Add your token:
HF_TOKEN=hf_xxxxx - Ensure you've accepted the Gemma license
Your GPU is too old. FunctionGemma training requires:
- Ampere architecture (RTX 30xx series) or newer
- Compute capability 8.0+
NVIDIA Container Toolkit isn't properly installed:
# Test GPU access
docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smiIf this fails, reinstall the Container Toolkit.
This means PyTorch cannot detect your GPU:
-
Verify your GPU is recognized by the host:
nvidia-smi
If this fails, your NVIDIA drivers are not installed or GPU is not connected properly.
-
Verify Docker can access the GPU:
docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi
If this fails, install or reinstall the NVIDIA Container Toolkit.
-
Check Docker Compose version:
docker compose version
GPU passthrough requires Docker Compose v2.x+. If you're on v1.x, upgrade Docker.
-
Restart Docker daemon after driver installation:
sudo systemctl restart docker
This shouldn't happen with correct configuration, but if it does:
- Verify
attn_implementation='eager'intrain.py - Verify
torch_dtype=torch.bfloat16intrain.py - Try reducing learning rate
- Check the Gemma Cookbook for reference implementation
- Visit HuggingFace Forums for ML-specific questions
- Open an issue in this repository for project-specific problems
To enter the container and run commands manually:
docker compose run --rm training bash
python train.pyEdit train.py to change:
- Number of epochs
- Learning rate
- Batch size
- Output directory
Training automatically saves checkpoints. If interrupted:
- Checkpoints are in
./results/checkpoint-xxx/ - Modify
train.pyto resume from a checkpoint usingtrainer.train(resume_from_checkpoint=True)
After training, load your model:
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained('./results/final_model')
tokenizer = AutoTokenizer.from_pretrained('./results/final_model')
# Use for inference
inputs = tokenizer("Open the camera app", return_tensors="pt")
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0]))- FunctionGemma Model Card
- Mobile Actions Dataset
- Gemma Cookbook Finetuning Notebook
- TRL SFTTrainer Documentation
This project is for educational purposes. Model usage is subject to the Gemma License.