Linked LoRA skill frames for small continual-learning experiments.
SkillStack New is an experimental memory architecture for continual learning. It stores each learned skill as a separate linked frame instead of overwriting one adapter forever. Each frame keeps a saved LoRA adapter, task metadata, a parent pointer, and evaluation metrics.
The current prototype uses Llama-family QLoRA sequence-classification adapters on small text-classification tasks.
Download the current SkillStack paper PDF.
Standard sequential fine-tuning can overwrite earlier skills. SkillStack New uses a stack of skill frames:
frame 0: IMDB adapter
frame 1: Yelp adapter, parent=frame 0
frame 2: Rotten adapter, parent=frame 1
...
At inference time, a router selects the frame that should answer the input. The main research question is whether old frames preserve old skills better than the latest sequential adapter.
The newest router variant, hybrid_chain, uses the same strong word+character
SVM scores as the external router, but walks the linked stack from the latest
frame to its parents before falling back to the best global score.
The strongest current check is a stabilized 8-task Llama-family QLoRA run:
IMDB -> Yelp -> Rotten Tomatoes -> SST-2 -> Amazon -> AGNews -> Yahoo -> DBpedia
Each task used 48 examples per class for LoRA training, 64 examples per class
for evaluation, 3 epochs, lr=2e-4, and a reset classification head between
frames. The router used extra task-identification examples.
| Model | Router | Seed | Latest adapter forgetting | SkillStack forgetting | SkillStack task-aware avg | SkillStack routed overall | Router task acc | Stack traversal |
|---|---|---|---|---|---|---|---|---|
| Llama-3.2-1B | hybrid_chain | 42 | 34.1pp avg | ~0pp | 87.6% | 85.4% | 82.6% | fallback 21.0%, depth 3.53 |
| Llama-3.2-1B | hybrid_chain | 123 | 23.6pp avg | ~0pp | 87.8% | 87.0% | 85.5% | fallback 21.6%, depth 3.55 |
| Llama-3.2-3B | hybrid_chain | 42 | 24.5pp avg | ~0pp | 86.4% | 84.4% | 82.6% | fallback 21.0%, depth 3.53 |
| Llama-3.2-3B | hybrid_chain | 123 | 35.2pp avg | ~0pp | 90.5% | 89.1% | 85.5% | fallback 21.6%, depth 3.55 |
Plainly: the latest sequential adapter forgot many earlier tasks, while retrieving the saved SkillStack frame kept SkillStack forgetting near zero. With routing, SkillStack kept most of that saved-frame performance while traversing the linked stack instead of using only the final adapter.
A harder 12-task stress test adds QNLI entailment, MRPC paraphrase, QQP duplicate-question detection, and BoolQ yes/no QA:
IMDB -> Yelp -> Rotten -> SST-2 -> Amazon -> QNLI -> MRPC -> QQP -> BoolQ -> AGNews -> Yahoo -> DBpedia
This test used 64 examples per class for LoRA training, 96 examples per class
for evaluation, 256 routing examples per task, 3 epochs, lr=2e-4, and a reset
classification head between frames. BoolQ remained the weakest task, but the
stack still preserved learned frames while the final sequential adapter forgot
heavily.
| Model | Router | Seed | Latest adapter forgetting | SkillStack forgetting | SkillStack task-aware avg | SkillStack routed overall | Router task acc | Stack traversal |
|---|---|---|---|---|---|---|---|---|
| Llama-3.2-1B | hybrid_chain | 42 | 36.5pp avg | ~0pp | 80.3% | 79.3% | 88.7% | fallback 16.8%, depth 5.40 |
| Llama-3.2-1B | hybrid_chain | 123 | 28.4pp avg | ~0pp | 80.6% | 79.9% | 89.8% | fallback 16.1%, depth 5.39 |
| Llama-3.2-3B | hybrid_chain | 42 | 30.9pp avg | ~0pp | 84.2% | 82.7% | 88.7% | hard12 stress run |
A fixed multi-head joint baseline was also run on the two 1B hard12 seeds. It averaged 76.2%, while task-aware SkillStack averaged 80.5% and routed SkillStack averaged 79.6%.
An initial hard12 EWC baseline over two 1B seeds averaged 57.9% final accuracy with 22.6pp average forgetting. On the same seeds, routed SkillStack averaged 79.6%.
A separate generated-answer CausalLM LoRA probe used mixed instruction-style
tasks (gen_mixed5: IMDB, Amazon, QQP, BoolQ, AGNews). Over two seeds, the
latest sequential generative adapter averaged 60.3%, while routed SkillStack
averaged 75.8% with 0.0pp SkillStack forgetting and 96.7% router task accuracy.
QQP and BoolQ remain weak, so this is early evidence rather than a final
generative benchmark.
The newest compression probe tests whether SkillStack must keep every saved frame. A mixed 8-frame stack was trained with five sentiment skills plus QQP, BoolQ, and AGNews:
IMDB -> Amazon -> Yelp -> Rotten -> SST-2 -> QQP -> BoolQ -> AGNews
After training, all sentiment traffic was compressed into one sentiment anchor
frame (SST-2), while QQP, BoolQ, and AGNews remained separate.
| Seed | Full frames | Full routed | Compressed frames | Compressed routed | Adapter memory | Accuracy change |
|---|---|---|---|---|---|---|
| 42 | 8 | 82.4% | 4 | 84.4% | -50.0% | +2.0pp |
| 123 | 8 | 80.8% | 4 | 82.2% | -50.0% | +1.5pp |
This suggests that related sentiment frames can be compressed into a single representative frame in this setting. A boundary stress test then removed one of the remaining cross-family frames:
| Removed after compression | Seed 42 overall loss | Seed 123 overall loss | Main task hit |
|---|---|---|---|
no sentiment frame (SST-2 removed) |
0.9pp | 9.7pp | SST-2 -3.1pp / -7.0pp |
QQP removed |
0.4pp | 2.0pp | QQP -18.8pp / -27.3pp |
BoolQ removed |
-0.4pp | -1.5pp | BoolQ -12.5pp / -1.6pp |
AGNews removed |
2.7pp | 5.0pp | AGNews -37.5pp / -52.3pp |
The key interpretation is that SkillStack has redundancy inside related skill families, but cross-family compression exposes sharp task-specific failures.
The newest skill-generalization probe tests whether routed SkillStack can reuse saved frames on mixed unseen examples, not only preserve old benchmark scores:
| Seed | Latest adapter avg | Routed SkillStack avg | Best single frame | Routed gain |
|---|---|---|---|---|
| 42 | 46.4% | 89.8% | 87.0% | +43.5pp |
| 123 | 75.0% | 91.1% | 88.3% | +16.1pp |
| 7 | 39.8% | 93.5% | 89.3% | +53.6pp |
| Mean | 53.7% | 91.5% | 88.2% | +37.7pp |
An initial EWC baseline was also run on the stable 1B hard8 seed-42 setting. This EWC implementation regularizes trainable LoRA parameters with a diagonal Fisher estimate and excludes the classification head:
| Method | Avg accuracy | Avg forgetting | Notes |
|---|---|---|---|
| Latest sequential adapter | 52.3% | 32.4pp | final adapter only |
| EWC baseline | 57.9% | 27.5pp | diagonal Fisher on LoRA params |
| SkillStack task-aware | 84.8% | 0.0pp | correct saved frame selected |
| SkillStack routed | 83.2% | n/a | hybrid_chain router |
- Saved frames preserve old skills with near-zero forgetting in the tested settings.
- On the skill-generalization probe, routed SkillStack averaged 91.5% across three seeds, compared with 53.7% for the final sequential adapter.
- On three hard8 seeds, SkillStack is close to or above an all-data multi-head joint baseline.
- The same hard8 pattern reproduced on a larger Llama-3.2-3B model. Across four recorded seeds, task-aware SkillStack averaged 84.4% vs 77.8% for the fixed multi-head joint baseline; routed SkillStack averaged 82.7%.
- In the harder hard12 stress test, routed SkillStack reached 79.3--79.9% on Llama-3.2-1B and 82.7% on Llama-3.2-3B while the final sequential adapter stayed near 44--53% average accuracy.
- In the mixed8 compression probe, collapsing five sentiment frames into one
SST-2anchor reduced adapter memory by 50% and slightly improved routed accuracy in two seeds. Removing cross-family frames such as QQP or AGNews caused large per-task failures, suggesting that compression should happen within related skill families rather than across all skills. - The new
hybrid_chainrouter matches the externalhybrid_svmcontrol on a stable 1B hard8 setting while actually traversing the linked stack, and has two strong 3B validation runs. - Independent adapters trained from scratch performed much worse in the same low-data setting.
- The best current routers are word+character TF-IDF LinearSVM variants:
external
hybrid_svmfor the simplest baseline andhybrid_chainfor stack-aware traversal.
- This is not a universal continual-learning solution.
- Results are still small-data Kaggle/Colab runs: mostly Llama-3.2-1B, plus Llama-3.2-3B validation runs.
- The router is not learned end-to-end with the LLM.
- Earlier stack-native routers were weaker;
hybrid_chainis promising, but still needs more seeds, larger-model repeats, and stronger baselines. - The hard12 result is a stress test, not a fully tuned benchmark; BoolQ remains weak and needs better formatting, longer context, or more data.
- A domain-confusion router probe shows that
hybrid_chainis brittle when task inputs are deliberately rewritten in another task's surface format: normal router accuracy averaged 89.3%, but domain-confusion accuracy averaged 47.5%. A format-augmented router recovered to 89.4% on the same probe, while keeping normal routing at 89.2%. On held-out rewrite templates, the same augmented router reached 86.6%, suggesting the fix is not only memorizing one exact template family. - A separate adversarial-keyword probe showed the original router dropping from 89.6% normal accuracy to 67.4% when misleading keywords were appended. A keyword-augmented router averaged 90.3% on same-template adversarial examples with only 0.4% keyword-trap rate, and 88.6% on held-out keyword templates.
- A hard5 order-sensitivity check found that reverse task order produced higher saved-frame/routed accuracy over two seeds (89.2% routed vs 84.3% forward), but router task accuracy stayed near 99% in both orders. This suggests the observed sensitivity is mostly frame-learning quality, not chain-router failure.
- Replay and joint baselines need more tuning before broad claims.
- The current EWC comparison is only an initial LoRA-parameter EWC baseline, not a fully tuned full-model EWC study.
- The compression result is currently a post-hoc routing/adapter-memory probe: it shows that some saved frames can be removed after training, but it does not yet implement an automatic online merge/prune algorithm.
-
scripts/colab_multiframe_router_test.py
Main Colab/Kaggle experiment script. -
scripts/colab_skillstack_test.py
Smaller two-task prototype. -
scripts/colab_skill_generalization_probe.py
Tests whether saved frames help on mixed unseen examples and transfer-style probes. -
scripts/colab_answer_quality_memory_probe.py
Checks whether SkillStack-style memory context helps or hurts generated answers. -
scripts/colab_generative_skillstack_test.py
Trains CausalLM LoRA skill frames and tests generated-answer forgetting. -
scripts/colab_domain_confusion_router_probe.py
Stress-tests whether the router follows semantic skill identity or misleading input format. -
scripts/summarize_results.py
Summarizes result files and prints router confusion diagnostics. -
docs/colab.md
Copy-paste notebook instructions. -
docs/prepublish_tests.md
Full experiment log and extra test commands. -
docs/project_pitch.md
Short project pitch for GitHub, competitions, and outreach. -
docs/testing_roadmap.md
Practical next-test plan and tests to stop repeating. -
docs/validation_plan.md
Stronger validation plan for larger models, more seeds, and tougher baselines. -
docs/hard12_results.md
Recorded hard12 stress-test and multi-head showdown results. -
docs/generative_skillstack_results.md
First CausalLM LoRA SkillStack generated-answer forgetting result. -
docs/domain_confusion_probe.md
Domain-confusion router stress-test instructions. -
docs/order_sensitivity_test.md
Forward-vs-reverse task-order test for parent-chain routing stability. -
docs/frame_redundancy_test.md
Sentiment-frame ablation test for redundancy and substitution behavior. -
docs/frame_compression_test.md
Mixed-skill compression and global compression-boundary results. -
PUBLISHING.md
Exact upload checklist and public-claim wording. -
paper.pdf
Current compiled technical report for quick reading. -
paper/
Overleaf-ready technical report draft and future-work notes. -
results/
Hand-recorded JSON summaries from the runs so far.
Install dependencies in Colab or Kaggle:
!pip install -q -U "transformers>=4.43.0" datasets accelerate peft bitsandbytes scikit-learnWrite the main script:
%%writefile colab_multiframe_router_test.pyPaste scripts/colab_multiframe_router_test.py, then run:
!CUDA_VISIBLE_DEVICES=0 HF_TOKEN=$HF_TOKEN python colab_multiframe_router_test.py \
--model meta-llama/Llama-3.2-1B-Instruct \
--preset hard8 \
--device cuda \
--samples-per-class 48 \
--router-samples-per-class 256 \
--eval-samples-per-class 64 \
--max-length 160 \
--epochs-per-task 3 \
--lr 2e-4 \
--router hybrid_chain \
--chain-threshold 0.0 \
--reset-head-between-frames \
--seed 42 \
--output-dir skillstack_new_hard8_hybrid_chain_stable_seed42For Kaggle, store your Hugging Face token as a secret named HF_TOKEN. If using
meta-llama/Llama-3.2-1B-Instruct, your Hugging Face account must have access to
the gated Meta model.
--preset hard8
--samples-per-class 48
--router-samples-per-class 256
--eval-samples-per-class 64
--max-length 160
--epochs-per-task 3
--lr 2e-4
--router hybrid_chain
--chain-threshold 0.0
--reset-head-between-frames
--router-samples-per-class 512 gave only a tiny improvement over 256, so 256
is a reasonable default.
- Update the technical report with hard12, router stress, generative, and frame-compression results.
- Repeat the mixed8 compression probe on another model size.
- Add stronger replay/task-conditioned baselines.
- Package a clean demo that loads saved frames and routes examples.
See docs/testing_roadmap.md for the current testing priority list.
See docs/validation_plan.md for stronger tests before competitions or serious
outreach.
Research prototype. Good enough for a GitHub project and further experiments; not yet a production method or a broad scientific claim.