Skip to content

g1g4b1t/skillstack-new

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SkillStack New

Linked LoRA skill frames for small continual-learning experiments.

SkillStack New is an experimental memory architecture for continual learning. It stores each learned skill as a separate linked frame instead of overwriting one adapter forever. Each frame keeps a saved LoRA adapter, task metadata, a parent pointer, and evaluation metrics.

The current prototype uses Llama-family QLoRA sequence-classification adapters on small text-classification tasks.

Paper

Download the current SkillStack paper PDF.

Idea

Standard sequential fine-tuning can overwrite earlier skills. SkillStack New uses a stack of skill frames:

frame 0: IMDB adapter
frame 1: Yelp adapter, parent=frame 0
frame 2: Rotten adapter, parent=frame 1
...

At inference time, a router selects the frame that should answer the input. The main research question is whether old frames preserve old skills better than the latest sequential adapter.

The newest router variant, hybrid_chain, uses the same strong word+character SVM scores as the external router, but walks the linked stack from the latest frame to its parents before falling back to the best global score.

Key Result

The strongest current check is a stabilized 8-task Llama-family QLoRA run:

IMDB -> Yelp -> Rotten Tomatoes -> SST-2 -> Amazon -> AGNews -> Yahoo -> DBpedia

Each task used 48 examples per class for LoRA training, 64 examples per class for evaluation, 3 epochs, lr=2e-4, and a reset classification head between frames. The router used extra task-identification examples.

Model Router Seed Latest adapter forgetting SkillStack forgetting SkillStack task-aware avg SkillStack routed overall Router task acc Stack traversal
Llama-3.2-1B hybrid_chain 42 34.1pp avg ~0pp 87.6% 85.4% 82.6% fallback 21.0%, depth 3.53
Llama-3.2-1B hybrid_chain 123 23.6pp avg ~0pp 87.8% 87.0% 85.5% fallback 21.6%, depth 3.55
Llama-3.2-3B hybrid_chain 42 24.5pp avg ~0pp 86.4% 84.4% 82.6% fallback 21.0%, depth 3.53
Llama-3.2-3B hybrid_chain 123 35.2pp avg ~0pp 90.5% 89.1% 85.5% fallback 21.6%, depth 3.55

Plainly: the latest sequential adapter forgot many earlier tasks, while retrieving the saved SkillStack frame kept SkillStack forgetting near zero. With routing, SkillStack kept most of that saved-frame performance while traversing the linked stack instead of using only the final adapter.

A harder 12-task stress test adds QNLI entailment, MRPC paraphrase, QQP duplicate-question detection, and BoolQ yes/no QA:

IMDB -> Yelp -> Rotten -> SST-2 -> Amazon -> QNLI -> MRPC -> QQP -> BoolQ -> AGNews -> Yahoo -> DBpedia

This test used 64 examples per class for LoRA training, 96 examples per class for evaluation, 256 routing examples per task, 3 epochs, lr=2e-4, and a reset classification head between frames. BoolQ remained the weakest task, but the stack still preserved learned frames while the final sequential adapter forgot heavily.

Model Router Seed Latest adapter forgetting SkillStack forgetting SkillStack task-aware avg SkillStack routed overall Router task acc Stack traversal
Llama-3.2-1B hybrid_chain 42 36.5pp avg ~0pp 80.3% 79.3% 88.7% fallback 16.8%, depth 5.40
Llama-3.2-1B hybrid_chain 123 28.4pp avg ~0pp 80.6% 79.9% 89.8% fallback 16.1%, depth 5.39
Llama-3.2-3B hybrid_chain 42 30.9pp avg ~0pp 84.2% 82.7% 88.7% hard12 stress run

A fixed multi-head joint baseline was also run on the two 1B hard12 seeds. It averaged 76.2%, while task-aware SkillStack averaged 80.5% and routed SkillStack averaged 79.6%.

An initial hard12 EWC baseline over two 1B seeds averaged 57.9% final accuracy with 22.6pp average forgetting. On the same seeds, routed SkillStack averaged 79.6%.

A separate generated-answer CausalLM LoRA probe used mixed instruction-style tasks (gen_mixed5: IMDB, Amazon, QQP, BoolQ, AGNews). Over two seeds, the latest sequential generative adapter averaged 60.3%, while routed SkillStack averaged 75.8% with 0.0pp SkillStack forgetting and 96.7% router task accuracy. QQP and BoolQ remain weak, so this is early evidence rather than a final generative benchmark.

The newest compression probe tests whether SkillStack must keep every saved frame. A mixed 8-frame stack was trained with five sentiment skills plus QQP, BoolQ, and AGNews:

IMDB -> Amazon -> Yelp -> Rotten -> SST-2 -> QQP -> BoolQ -> AGNews

After training, all sentiment traffic was compressed into one sentiment anchor frame (SST-2), while QQP, BoolQ, and AGNews remained separate.

Seed Full frames Full routed Compressed frames Compressed routed Adapter memory Accuracy change
42 8 82.4% 4 84.4% -50.0% +2.0pp
123 8 80.8% 4 82.2% -50.0% +1.5pp

This suggests that related sentiment frames can be compressed into a single representative frame in this setting. A boundary stress test then removed one of the remaining cross-family frames:

Removed after compression Seed 42 overall loss Seed 123 overall loss Main task hit
no sentiment frame (SST-2 removed) 0.9pp 9.7pp SST-2 -3.1pp / -7.0pp
QQP removed 0.4pp 2.0pp QQP -18.8pp / -27.3pp
BoolQ removed -0.4pp -1.5pp BoolQ -12.5pp / -1.6pp
AGNews removed 2.7pp 5.0pp AGNews -37.5pp / -52.3pp

The key interpretation is that SkillStack has redundancy inside related skill families, but cross-family compression exposes sharp task-specific failures.

The newest skill-generalization probe tests whether routed SkillStack can reuse saved frames on mixed unseen examples, not only preserve old benchmark scores:

Seed Latest adapter avg Routed SkillStack avg Best single frame Routed gain
42 46.4% 89.8% 87.0% +43.5pp
123 75.0% 91.1% 88.3% +16.1pp
7 39.8% 93.5% 89.3% +53.6pp
Mean 53.7% 91.5% 88.2% +37.7pp

An initial EWC baseline was also run on the stable 1B hard8 seed-42 setting. This EWC implementation regularizes trainable LoRA parameters with a diagonal Fisher estimate and excludes the classification head:

Method Avg accuracy Avg forgetting Notes
Latest sequential adapter 52.3% 32.4pp final adapter only
EWC baseline 57.9% 27.5pp diagonal Fisher on LoRA params
SkillStack task-aware 84.8% 0.0pp correct saved frame selected
SkillStack routed 83.2% n/a hybrid_chain router

What Looks Promising

  • Saved frames preserve old skills with near-zero forgetting in the tested settings.
  • On the skill-generalization probe, routed SkillStack averaged 91.5% across three seeds, compared with 53.7% for the final sequential adapter.
  • On three hard8 seeds, SkillStack is close to or above an all-data multi-head joint baseline.
  • The same hard8 pattern reproduced on a larger Llama-3.2-3B model. Across four recorded seeds, task-aware SkillStack averaged 84.4% vs 77.8% for the fixed multi-head joint baseline; routed SkillStack averaged 82.7%.
  • In the harder hard12 stress test, routed SkillStack reached 79.3--79.9% on Llama-3.2-1B and 82.7% on Llama-3.2-3B while the final sequential adapter stayed near 44--53% average accuracy.
  • In the mixed8 compression probe, collapsing five sentiment frames into one SST-2 anchor reduced adapter memory by 50% and slightly improved routed accuracy in two seeds. Removing cross-family frames such as QQP or AGNews caused large per-task failures, suggesting that compression should happen within related skill families rather than across all skills.
  • The new hybrid_chain router matches the external hybrid_svm control on a stable 1B hard8 setting while actually traversing the linked stack, and has two strong 3B validation runs.
  • Independent adapters trained from scratch performed much worse in the same low-data setting.
  • The best current routers are word+character TF-IDF LinearSVM variants: external hybrid_svm for the simplest baseline and hybrid_chain for stack-aware traversal.

What Is Not Proven Yet

  • This is not a universal continual-learning solution.
  • Results are still small-data Kaggle/Colab runs: mostly Llama-3.2-1B, plus Llama-3.2-3B validation runs.
  • The router is not learned end-to-end with the LLM.
  • Earlier stack-native routers were weaker; hybrid_chain is promising, but still needs more seeds, larger-model repeats, and stronger baselines.
  • The hard12 result is a stress test, not a fully tuned benchmark; BoolQ remains weak and needs better formatting, longer context, or more data.
  • A domain-confusion router probe shows that hybrid_chain is brittle when task inputs are deliberately rewritten in another task's surface format: normal router accuracy averaged 89.3%, but domain-confusion accuracy averaged 47.5%. A format-augmented router recovered to 89.4% on the same probe, while keeping normal routing at 89.2%. On held-out rewrite templates, the same augmented router reached 86.6%, suggesting the fix is not only memorizing one exact template family.
  • A separate adversarial-keyword probe showed the original router dropping from 89.6% normal accuracy to 67.4% when misleading keywords were appended. A keyword-augmented router averaged 90.3% on same-template adversarial examples with only 0.4% keyword-trap rate, and 88.6% on held-out keyword templates.
  • A hard5 order-sensitivity check found that reverse task order produced higher saved-frame/routed accuracy over two seeds (89.2% routed vs 84.3% forward), but router task accuracy stayed near 99% in both orders. This suggests the observed sensitivity is mostly frame-learning quality, not chain-router failure.
  • Replay and joint baselines need more tuning before broad claims.
  • The current EWC comparison is only an initial LoRA-parameter EWC baseline, not a fully tuned full-model EWC study.
  • The compression result is currently a post-hoc routing/adapter-memory probe: it shows that some saved frames can be removed after training, but it does not yet implement an automatic online merge/prune algorithm.

Files

  • scripts/colab_multiframe_router_test.py
    Main Colab/Kaggle experiment script.

  • scripts/colab_skillstack_test.py
    Smaller two-task prototype.

  • scripts/colab_skill_generalization_probe.py
    Tests whether saved frames help on mixed unseen examples and transfer-style probes.

  • scripts/colab_answer_quality_memory_probe.py
    Checks whether SkillStack-style memory context helps or hurts generated answers.

  • scripts/colab_generative_skillstack_test.py
    Trains CausalLM LoRA skill frames and tests generated-answer forgetting.

  • scripts/colab_domain_confusion_router_probe.py
    Stress-tests whether the router follows semantic skill identity or misleading input format.

  • scripts/summarize_results.py
    Summarizes result files and prints router confusion diagnostics.

  • docs/colab.md
    Copy-paste notebook instructions.

  • docs/prepublish_tests.md
    Full experiment log and extra test commands.

  • docs/project_pitch.md
    Short project pitch for GitHub, competitions, and outreach.

  • docs/testing_roadmap.md
    Practical next-test plan and tests to stop repeating.

  • docs/validation_plan.md
    Stronger validation plan for larger models, more seeds, and tougher baselines.

  • docs/hard12_results.md
    Recorded hard12 stress-test and multi-head showdown results.

  • docs/generative_skillstack_results.md
    First CausalLM LoRA SkillStack generated-answer forgetting result.

  • docs/domain_confusion_probe.md
    Domain-confusion router stress-test instructions.

  • docs/order_sensitivity_test.md
    Forward-vs-reverse task-order test for parent-chain routing stability.

  • docs/frame_redundancy_test.md
    Sentiment-frame ablation test for redundancy and substitution behavior.

  • docs/frame_compression_test.md
    Mixed-skill compression and global compression-boundary results.

  • PUBLISHING.md
    Exact upload checklist and public-claim wording.

  • paper.pdf
    Current compiled technical report for quick reading.

  • paper/
    Overleaf-ready technical report draft and future-work notes.

  • results/
    Hand-recorded JSON summaries from the runs so far.

Quick Run

Install dependencies in Colab or Kaggle:

!pip install -q -U "transformers>=4.43.0" datasets accelerate peft bitsandbytes scikit-learn

Write the main script:

%%writefile colab_multiframe_router_test.py

Paste scripts/colab_multiframe_router_test.py, then run:

!CUDA_VISIBLE_DEVICES=0 HF_TOKEN=$HF_TOKEN python colab_multiframe_router_test.py \
  --model meta-llama/Llama-3.2-1B-Instruct \
  --preset hard8 \
  --device cuda \
  --samples-per-class 48 \
  --router-samples-per-class 256 \
  --eval-samples-per-class 64 \
  --max-length 160 \
  --epochs-per-task 3 \
  --lr 2e-4 \
  --router hybrid_chain \
  --chain-threshold 0.0 \
  --reset-head-between-frames \
  --seed 42 \
  --output-dir skillstack_new_hard8_hybrid_chain_stable_seed42

For Kaggle, store your Hugging Face token as a secret named HF_TOKEN. If using meta-llama/Llama-3.2-1B-Instruct, your Hugging Face account must have access to the gated Meta model.

Current Best Defaults

--preset hard8
--samples-per-class 48
--router-samples-per-class 256
--eval-samples-per-class 64
--max-length 160
--epochs-per-task 3
--lr 2e-4
--router hybrid_chain
--chain-threshold 0.0
--reset-head-between-frames

--router-samples-per-class 512 gave only a tiny improvement over 256, so 256 is a reasonable default.

Next Steps

  1. Update the technical report with hard12, router stress, generative, and frame-compression results.
  2. Repeat the mixed8 compression probe on another model size.
  3. Add stronger replay/task-conditioned baselines.
  4. Package a clean demo that loads saved frames and routes examples.

See docs/testing_roadmap.md for the current testing priority list. See docs/validation_plan.md for stronger tests before competitions or serious outreach.

Status

Research prototype. Good enough for a GitHub project and further experiments; not yet a production method or a broad scientific claim.

About

SkillStack: linked LoRA skill frames for continual-learning experiments.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors