Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
50 changes: 1 addition & 49 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,53 +5,5 @@
- Model Behavior-Level Simulation
- Hardware-Performance Simulation

**🔖 For tutorials and examples, please refer to [this site](https://aicrosssim.github.io/NewComputeBench/)**.
**🔖 For milestones, tutorials and examples, please refer to [this site](https://aicrosssim.github.io/NewComputeBench/)**.

## Model Training

### LLMs

We adopt Llama-3 architecture and aim to support the following features:

- Pretraining
- Generation (inference)
- Parameter-efficient fine-tuning
- `🚧 TODO` `🐌 LowPriority`: Supervised-fine-tuning
- Evaluation

#### PreTraining

The LLM pretraining is built on top of [torchtitan](https://github.com/pytorch/torchtitan).

- Model architecture: [`Llama3`](/src/torchtitan/models/llama/model.py)
- Model configs: [`60M`, `200M`, `400M`, `1.1B`](src/aixsim_models/llm/model_flavors.py)
- Datasets: [`HuggingFaceFW/fineweb`](/src/aixsim_models/llm/pretrain_data.py)
- HuggingFace checkpoints: [AICrossSim](https://huggingface.co/AICrossSim)

#### Generation

We recommend using the HuggingFace Transformers library for generation tasks.
We provide a script to convert the torchtitan checkpoint to a HuggingFace checkpoint (See [this file](/experiments/llm-digital/pretrain/README.md)).


#### Parameter-Efficient Fine-tuning
- For models larger than 1.1B, we fine-tune pretrained checkpoints.
- LoRA fine-tuning data
- LoRA fine-tuning scripts

## Model Behavior Simulation

- [Random bitflip](/experiments/llm-bitflip/)
- Post-training bitflip transform
- Bitflip-aware pretraining
- Optical compute
- [Roberta on GLUE](/experiments/roberta-optical-transformer/)
- CLM `🚧 WIP`

- Spiking neural networks `🚧 TODO`
- In-memory compute `🚧 TODO
`

## Hardware-Performance Simulation

`🚧 TODO`
222 changes: 222 additions & 0 deletions docs/02-model-behaviour-level-simulation/clm-bitflip-lora-finetune.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,222 @@
# Bitflip-Aware LoRA Fine-Tuning

This tutorial walks through how to run bitflip-aware LoRA fine-tuning on a pretrained LLM (e.g., `unsloth/Llama-3.1-8B`) using our custom training script.

## Overview

Bitflip-aware LoRA fine-tuning combines two ideas:

1. **Random Bitflip Simulation** — During the forward pass, random bit flips are injected into both activations and weights of every linear layer (except `lm_head`). This emulates hardware-level bit errors that occur in approximate or unreliable compute substrates.
2. **Low-Rank Adaptation (LoRA)** — Instead of fine-tuning all parameters, we attach small low-rank matrices (`lora_A`, `lora_B`) to each linear layer and only train those. The original pretrained weights are frozen.

By fine-tuning with bitflip noise injected during training, the LoRA adapters learn to compensate for hardware-induced errors, making the model more resilient at inference time.

### How It Works

Each `nn.Linear` layer in the model is replaced by a [`BitFlipLinearLora`](https://github.com/AICrossSim/NewComputeBench/blob/master/src/aixsim_models/bitflip/fine_tune/bitflip_lora.py) layer. The forward pass of `BitFlipLinearLora` performs the following:

```
Y = bitflip(X) @ bitflip(W + B @ A * scaling)^T + bias
```

where:

- `X` is the input activation (with optional bitflip noise).
- `W` is the frozen pretrained weight.
- `A` (`lora_A`) and `B` (`lora_B`) are the trainable low-rank matrices.
- `scaling = lora_alpha / r` controls the magnitude of the LoRA update.
- `bitflip(·)` applies random bit flips to the sign-exponent and mantissa bits of the FP32 representation, controlled by per-component probabilities.

The model transformation is handled by the [`transform_llama`](https://github.com/AICrossSim/NewComputeBench/blob/master/src/aixsim_models/bitflip/fine_tune/bitflip_llama.py) function, which iterates over all `nn.Linear` modules in the model (excluding `lm_head`) and replaces them with `BitFlipLinearLora`.

### Entry Points

| File | Description |
|------|-------------|
| [`experiments/llm-bitflip/lora_finetune/run_clm_no_trainer.py`](https://github.com/AICrossSim/NewComputeBench/blob/master/experiments/llm-bitflip/lora_finetune/run_clm_no_trainer.py) | Main training script (HuggingFace Accelerate-based, no Trainer) |
| [`experiments/llm-bitflip/lora_finetune/fine-tune-bitflip-clm.sh`](https://github.com/AICrossSim/NewComputeBench/blob/master/experiments/llm-bitflip/lora_finetune/fine-tune-bitflip-clm.sh) | Shell wrapper that computes training steps and launches the run |
| [`experiments/llm-bitflip/lora_finetune/transform_cfg.toml`](https://github.com/AICrossSim/NewComputeBench/blob/master/experiments/llm-bitflip/lora_finetune/transform_cfg.toml) | Bitflip + LoRA configuration file |

## Step-by-Step Guide

!!! info "Environment Setup"

If you have not set up environments, please follow the guidelines in [Environment Setup](../env-setup.md).

### 1. Configure the Bitflip & LoRA Transform

The transform configuration is defined in a TOML file. Here is the default configuration at [`experiments/llm-bitflip/lora_finetune/transform_cfg.toml`](https://github.com/AICrossSim/NewComputeBench/blob/master/experiments/llm-bitflip/lora_finetune/transform_cfg.toml):

```toml
use_lora = true

[fc]
w_p_exp = 1.52587890625e-05
w_p_frac = 1.52587890625e-05
w_zero_out_t = 1.25
x_p_exp = 1.52587890625e-05
x_p_frac = 1.52587890625e-05
x_zero_out_t = 30.0

[lora]
r = 32
lora_alpha = 32
```

**Configuration parameters:**

| Section | Parameter | Description |
|---------|-----------|-------------|
| (top-level) | `use_lora` | Enable LoRA adaptation (`true`/`false`). When `false`, all parameters are trained. |
| `[fc]` | `w_p_exp` | Bitflip probability for the sign-exponent bits of the **weight**. |
| `[fc]` | `w_p_frac` | Bitflip probability for the mantissa bits of the **weight**. |
| `[fc]` | `w_zero_out_t` | Threshold for zeroing out weight outliers / NaN values. |
| `[fc]` | `x_p_exp` | Bitflip probability for the sign-exponent bits of the **activation**. |
| `[fc]` | `x_p_frac` | Bitflip probability for the mantissa bits of the **activation**. |
| `[fc]` | `x_zero_out_t` | Threshold for zeroing out activation outliers / NaN values. |
| `[lora]` | `r` | LoRA rank. |
| `[lora]` | `lora_alpha` | LoRA scaling factor (effective scaling = `lora_alpha / r`). |

!!! note "Bitflip probability"
The bitflip probability must be a power of 0.5 (e.g., `0.5^16 ≈ 1.526e-05`). The kernel automatically snaps to the nearest valid value. Due to limitations of the Philox PRNG, the minimum supported probability is `0.5^24 ≈ 5.96e-08`. See the [mase-triton docs](../02-model-behaviour-level-simulation/mase-triton.md) for more details.

### 2. Understand the Training Budget

The shell script [`fine-tune-bitflip-clm.sh`](https://github.com/AICrossSim/NewComputeBench/blob/master/experiments/llm-bitflip/lora_finetune/fine-tune-bitflip-clm.sh) automatically calculates the number of training steps based on a budget of **1% of the model's parameter count in tokens**. For `unsloth/Llama-3.1-8B` (8B parameters):

```
fine-tune tokens = 8,000,000,000 / 100 = 80,000,000 tokens
tokens per step = num_gpus × per_device_batch_size × block_size
max_train_steps = fine-tune tokens / tokens per step
```

For example, with 8 GPUs, batch size 1, and block size 2048:

```
tokens per step = 8 × 1 × 2048 = 16,384
max_train_steps = 80,000,000 / 16,384 ≈ 4,883 steps
```

### 3. Launch the Fine-Tuning

```bash
cd experiments/llm-bitflip/lora_finetune
```

The script accepts positional arguments to override defaults:

```bash
./fine-tune-bitflip-clm.sh [num_processes] [model_name_or_path] [per_device_train_batch_size] [learning_rate] [weight_decay] [gradient_accumulation_steps] [block_size]
```

**Example: Fine-tune Llama-3.1-8B on 8 GPUs with default settings**

```bash
./fine-tune-bitflip-clm.sh 8 unsloth/Llama-3.1-8B 1 1e-5 0.01 2 2048
```

This is equivalent to running the underlying command directly:

```bash
uv run accelerate launch --num_processes=8 \
run_clm_no_trainer.py \
--model_name_or_path unsloth/Llama-3.1-8B \
--dataset_name Cheng98/fineweb-edu-1.25B \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--learning_rate 1e-5 \
--weight_decay 0.01 \
--num_train_epochs 1 \
--gradient_accumulation_steps 2 \
--lr_scheduler_type linear \
--output_dir ./output/Llama-3.1-8B-bitflip-lora \
--preprocessing_num_workers 32 \
--trust_remote_code \
--with_tracking \
--report_to wandb \
--transform_cfg ./transform_cfg.toml \
--block_size 2048 \
--log_train_loss_steps 50 \
--max_train_steps 4883 \
--wandb_tags unsloth/Llama-3.1-8B,lr1e-5,steps4883
```

**Key arguments:**

| Argument | Description |
|----------|-------------|
| `--model_name_or_path` | HuggingFace model identifier or local path. |
| `--dataset_name` | Training dataset. We use a 1.25B-token subset of [FineWeb-Edu](https://huggingface.co/datasets/Cheng98/fineweb-edu-1.25B). |
| `--transform_cfg` | Path to the TOML config for bitflip + LoRA. |
| `--block_size` | Context length for training samples. |
| `--log_train_loss_steps` | Log training loss to W&B every N steps. |
| `--max_train_steps` | Total number of optimizer steps (auto-calculated by the shell script). |

!!! tip "Adjusting GPU count"
The first argument to `fine-tune-bitflip-clm.sh` controls `--num_processes` for `accelerate launch`. The script automatically recalculates `max_train_steps` to maintain the same total token budget regardless of the number of GPUs.

### 4. Monitor Training

If you have W&B set up (`wandb login`), training loss and validation perplexity are logged automatically. The training logs to the W&B project `Bitflip-CLM-Fine-tune`.

- **Training loss** is logged every 50 steps (configurable via `--log_train_loss_steps`).
- **Validation perplexity** is evaluated at the end of each epoch on the first 64 batches of the validation set.

### 5. Output

After training completes, the fine-tuned model (with LoRA weights merged into the base model) and tokenizer are saved to the output directory:

```
./output/Llama-3.1-8B-bitflip-lora/
├── config.json
├── model.safetensors
├── tokenizer.json
├── tokenizer_config.json
└── all_results.json # Final perplexity
```

## Results

!!! warning "Results Pending"
The following results are placeholders and will be updated once experiments complete.

### Training Curves

![Bitflip LoRA Fine-Tuning Curves](../images/bitflip/7b-lora-trainloss.png){ width=720px }


| Metric | Value |
|--------|-------|
| Final Training Loss | *2.50* |
| Final Validation Perplexity | *11.01* |
| Total Training Steps | *4883* |

### Comparison with Baselines

We evaluate the model under three conditions:

| Bitflipped | Fine-tuned | Bitflip Config | Fine-tune Config | Train PPL |
|-------|---------------|------------------| ---------| ----|
| ✘ | ✘ | N/A | N/A | *7.91* |
| ✔ | ✘ | `w/x_p_exp=1.53e-5, w/x_p_frac=1.53e-5`| N/A | *1008.95* |
| ✔ | ✔ | `w/x_p_exp=1.53e-5, w/x_p_frac=1.53e-5` | Lora rank=32 | *11.01* |

From the table above, we can see that *Lora fine-tuning effectively mitigates the impact of bitflip noise, reducing perplexity from 1008.95 to 11.01* for a 7B model.

We can also safely assume that with more trainable parameters (e.g., a larger LoRA rank, or full fine-tuning) the model would be able to compensate for the noise even better.

### Resources

| Resource | Link |
|----------|------|
| W&B Logs | *https://wandb.ai/cz98/Bitflip-CLM-Fine-tune* |
| Training Config | [`transform_cfg.toml`](https://github.com/AICrossSim/NewComputeBench/blob/master/experiments/llm-bitflip/lora_finetune/transform_cfg.toml) |

## Appendix: Evaluation Scripts

The comparison table above was generated with two evaluation-only wrappers that reuse `run_clm_no_trainer.py` but bypass any optimizer steps. Both scripts share the signature `./script.sh [num_processes] [model_name_or_path] [per_device_batch_size] [block_size] [eval_max_steps]` so you can sweep models or batch sizes without editing Python code.

| Script | Purpose | Notes |
|--------|---------|-------|
| [`experiments/llm-bitflip/lora_finetune/eval-bitflip-no-finetune.sh`](https://github.com/AICrossSim/NewComputeBench/blob/master/experiments/llm-bitflip/lora_finetune/eval-bitflip-no-finetune.sh) | Measures perplexity when random bitflips are injected during inference. | This is biflipped (✔) fine-tuned (✘) entry |
| [`experiments/llm-bitflip/lora_finetune/eval-no-biflip-no-finetune.sh`](https://github.com/AICrossSim/NewComputeBench/blob/master/experiments/llm-bitflip/lora_finetune/eval-no-biflip-no-finetune.sh) | Serves as the clean baseline (no injected bitflips, no finetuning) so we can isolate the effect of noise. | This is biflip-free (✘) fine-tuned (✘) entry |
Binary file added docs/images/bitflip/7b-lora-trainloss.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
10 changes: 9 additions & 1 deletion docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,16 +12,24 @@
- [x] Filter out promising new compute paradigms by running small & medium scale experiments (Roberta on GLUE)
- [ ] Scale up the promising new compute paradigms to large-scale language models
- [ ] Fine-tuning/pretraining of CLM models (60M - 1.1B)
- [x] Random bitflip
- [x] Optical compute
- [ ] Spiking neural networks
- [ ] In-memory compute
- [ ] Parameter-efficient fine-tuning of larger LLMs (e.g., Llama-3.1-8B)
- [x] Random bitflip (promising results)
- [x] Optical compute (failed to converge)


## What's New

- 🚧**4th Oct, 2025 Milestone**: Fine-tuning/pretraining of alternative compute paradigms on CLMs.
- **4th, Feb, 2026 Milestone**: We have successfully fine-tuned Llama-3.1-8B with random bitflip noise injected in forward passes, and observed promising results that the LoRA adapters with only 1.2% trainable parameters can effectively mitigate the effect of noise (reducing perplexity from 1008.95 to 11.01, with the original clean perplexity at 7.91).

| Item | Description |
| ---- | ----------- |
| Llama-3.1-8B with random bitflip noise | [Tutorial](./02-model-behaviour-level-simulation/clm-bitflip-lora-finetune.md)

- **4th Oct, 2025 Milestone**: Fine-tuning/pretraining of alternative compute paradigms on CLMs.

| Item | Description |
| ---- | ----------- |
Expand Down
53 changes: 53 additions & 0 deletions experiments/llm-bitflip/lora_finetune/eval-bitflip-no-finetune.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
#!/bin/bash

# Evaluation-only run on train/validation splits with bitflip LoRA transform (no trainable params).
# Usage: ./eval-bitflip-clm.sh [num_processes] [model_name_or_path] [per_device_batch_size] [block_size] [eval_max_steps]

SCRIPT_DIR=$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" && pwd)
RUN_SCRIPT="${SCRIPT_DIR}/run_clm_no_trainer.py"
TRANSFORM_CFG="${SCRIPT_DIR}/transform_cfg.toml"

NUM_PROCESSES=${1:-8}
MODEL_NAME_OR_PATH=${2:-"unsloth/Llama-3.1-8B"}
PER_DEVICE_BATCH_SIZE=${3:-1}
BLOCK_SIZE=${4:-2048}
EVAL_MAX_STEPS=${5:-64}

OUTPUT_DIR="${SCRIPT_DIR}/output/$(basename ${MODEL_NAME_OR_PATH})-bitflip-lora-eval"
WANDB_TAGS="${MODEL_NAME_OR_PATH},bitflip,eval"

echo "============================================"
echo "Evaluation Only (Bitflip LoRA):"
echo "============================================"
echo "Model: ${MODEL_NAME_OR_PATH}"
echo "Number of Processes: ${NUM_PROCESSES}"
echo "Per Device Batch Size: ${PER_DEVICE_BATCH_SIZE}"
echo "Block Size: ${BLOCK_SIZE}"
if [ "${EVAL_MAX_STEPS}" -gt 0 ]; then
echo "Eval Max Steps per split: ${EVAL_MAX_STEPS}"
else
echo "Eval Max Steps per split: full dataset"
fi
echo "Output Directory: ${OUTPUT_DIR}"
echo "Wandb Tags: ${WANDB_TAGS}"
echo "============================================"

uv run accelerate launch --num_processes=${NUM_PROCESSES} \
"${RUN_SCRIPT}" \
--model_name_or_path ${MODEL_NAME_OR_PATH} \
--dataset_name Cheng98/fineweb-edu-1.25B \
--per_device_train_batch_size ${PER_DEVICE_BATCH_SIZE} \
--per_device_eval_batch_size ${PER_DEVICE_BATCH_SIZE} \
--num_train_epochs 1 \
--gradient_accumulation_steps 1 \
--lr_scheduler_type linear \
--output_dir ${OUTPUT_DIR} \
--preprocessing_num_workers 32 \
--trust_remote_code \
--with_tracking \
--report_to wandb \
--transform_cfg "${TRANSFORM_CFG}" \
--block_size ${BLOCK_SIZE} \
--eval_only \
--eval_max_steps ${EVAL_MAX_STEPS} \
--wandb_tags ${WANDB_TAGS}
Loading