Russian Version (Русская версия)
Important
Key Idea: This method provides a definitive solution to Catastrophic Forgetting. The architecture guarantees that training on new tasks mathematically cannot degrade neural pathways responsible for old knowledge. This makes it an ideal foundation for Continual Learning.
GraphAI is a Go-based implementation of the Dynamic Topology Graph - Masked Attention (DTG-MA) layer, integrated with real-world Large Language Models (LLMs) via the Cybertron library. This project demonstrates a Continual Learning architecture designed to adapt to new tasks without catastrophic forgetting.
Key architectural features:
-
Dynamic Topology: The graph structure expands dynamically as new tasks are introduced (
AddEdge). -
Topology-Aware Attention: Implements the masked attention formula
$Softmax(\frac{QK^T}{\sqrt{d}} + M_{task})V$ , where$M_{task}$ applies strict$-\infty$ masking to enforce task-specific topology. -
Zero-Forgetting: Old task parameters are explicitly frozen (
Frozenflag inDTGEdge) during new task training. - Real LLM Integration: Uses state-of-the-art pre-trained embeddings (e.g., BERT, MiniLM) as the input foundation.
-
True DTG-MA Logic:
- Edge Metadata: Every weight matrix is wrapped in a
DTGEdgestruct tracking its Task ID and frozen state. - Strict Masking: Uses
-Infmasking to rigorously block attention pathways, preventing interference between tasks. - Task Management: Explicit
TaskIDbased routing in theForwardpass.
- Edge Metadata: Every weight matrix is wrapped in a
-
Pure Go Ecosystem:
- Built on
Gorgoniafor computation graphs. - Integration with generic
Gotensors. - No Python dependencies for the core logic.
- Built on
Three main classes of solutions exist in Continual Learning, but each has significant drawbacks:
-
Elastic Weight Consolidation (EWC)
- Method: Uses Fisher Information Matrix to identify and penalize changes to "important" weights.
- Drawback: Computationally expensive (calculating Fisher Matrix) and only provides a "soft" constraint (forgetting can still happen).
-
Learning without Forgetting (LwF)
- Method: Uses Knowledge Distillation where the old model teaches the new one.
- Drawback: Requires maintaining the old model and running inference on it during training, doubling the compute load.
-
Parameter Isolation
- Method: Assigns separate sub-networks or adapters for each task.
- Drawback: Often leads to linear parameter growth without knowledge reuse.
GraphAI solves these issues by combining Dynamic Topology with Masked Attention:
- Efficient (vs EWC): No expensive Fisher Matrix calculations. Knowledge protection is architectural (
-InfMask + Freezing), which has near-zero overhead. - Fast (vs LwF): No need for Knowledge Distillation or keeping old models in memory.
- Guaranteed Isolation: Unlike soft constraints, the Masked Attention mechanism mathematically guarantees Zero-Forgetting.
- Flexible: The graph structure allows for potential knowledge reuse (unlike strict isolation) while maintaining separation where needed.
- Go 1.25+ (or configured environment variable
ASSUME_NO_MOVING_GC_UNSAFE_RISK_IT_WITH=go1.25for Gorgonia compatibility).
- Initialize the module:
go mod init graphai go mod tidy
- Ensure dependencies are downloaded:
go get gorgonia.org/gorgonia go get github.com/nlpodyssey/cybertron
Runs the training loop using sentence-transformers/all-MiniLM-L6-v2 (fast, small model).
export ASSUME_NO_MOVING_GC_UNSAFE_RISK_IT_WITH=go1.25
go run main.go layer.go real_llm.go head.goThis demo runs two modes back-to-back:
- Baseline (naive fine-tune): sequential training without freezing (and with a single shared classifier head).
- DTG-MA: freezes old edges + uses task-scoped routing (Task0 predictions do not change after training Task1).
Run (defaults to MiniLM embeddings):
export ASSUME_NO_MOVING_GC_UNSAFE_RISK_IT_WITH=go1.25
go run -tags continual_real_demo continual_real_demo.go layer.go head.go real_llm.goOptional configuration:
# reproducibility
export DTG_RUNS=3
export DTG_SEED=42
# embedding model selection
# export DTG_MODEL=bert-base-uncased
# training hyperparameters
# export DTG_EPOCHS0=500
# export DTG_EPOCHS1=450
# export DTG_LR=0.0015Runs the training loop using bert-base-uncased (standard 768-dim model) with a larger dataset.
export ASSUME_NO_MOVING_GC_UNSAFE_RISK_IT_WITH=go1.25
go run run_large.go layer.go real_llm.go head.goNote: The first run will download the model weights (~440MB).
export ASSUME_NO_MOVING_GC_UNSAFE_RISK_IT_WITH=go1.25
export DTG_MODEL=bert-base-uncased
export DTG_RUNS=1
export DTG_SEED=42
export DTG_EPOCHS0=500
export DTG_EPOCHS1=450
export DTG_LR=0.001
go run -tags continual_real_demo continual_real_demo.go layer.go head.go real_llm.goContains the core HybridGraphLayer implementation.
DTGEdge: Struct representing a learnable connection[Weight, TaskID, Frozen].Forward(input, taskID): Computes masked attention. It selects the mask corresponding totaskIDand applies it additively to the scaled dot-product scores before Softmax.FreezeOldTasks(currentTaskID): Iterates through edges and setsFrozen=truefor any edge belonging to previous tasks.
Additional helper used by the continual demos:
ForwardTaskScoped(input, taskID): computes projections using only edges belonging totaskID(useful for strict task isolation in demos).
A wrapper around the Cybertron library.
NewRealLLM(modelName): Loads a specific HuggingFace model.GetEmbeddings(texts): Converts a slice of strings into atrainRawtensor suitable for the graph.
To fine-tune the training process, modify the solver configuration in main.go or run_large.go:
// Adjust Learning Rate for convergence
solver := gorgonia.NewAdamSolver(gorgonia.WithLearnRate(0.001))To switch models, pass the model name to NewRealLLM:
llm, err := NewRealLLM("bert-large-uncased")MIT License.