Skip to content

Attn residuals+m hc#6

Open
adi-kmt wants to merge 8 commits into
IsNoobgrammer:mainfrom
adi-kmt:attn_residuals+mHC
Open

Attn residuals+m hc#6
adi-kmt wants to merge 8 commits into
IsNoobgrammer:mainfrom
adi-kmt:attn_residuals+mHC

Conversation

@adi-kmt
Copy link
Copy Markdown
Contributor

@adi-kmt adi-kmt commented May 6, 2026

Add configurable residual routing experiments

Implemented a set of optional residual-flow mechanisms inspired by mHC and
attention residuals. All new paths are config-gated and default to baseline
Transformer behavior when disabled.

Changes:

  • Added residual write gates around attention and MLP/MoE branch outputs.

    • New config:
      • residual_gate_type: none | scalar | token | channel
      • residual_gate_init
    • Applies:
      • hidden_states = residual + gate * branch_output
    • Keeps the identity residual path intact.
    • Supports per-layer scalar gates, per-token gates, and per-token/channel gates.
    • Initializes gates near 1.0 to preserve baseline behavior.
  • Added causal residual-depth convolution.

    • New config:
      • residual_mixer_type: none | causal_conv | dynamic_causal_conv
      • residual_conv_kernel_size
      • residual_conv_init
    • Mixes previous residual states plus the current layer output over model depth.
    • Fully causal over depth: a layer only sees earlier layer states and its own output.
    • Does not convolve over sequence tokens, so token-level causality is preserved.
    • causal_conv uses a learned static depth kernel per layer.
    • dynamic_causal_conv uses token-conditioned depth kernels from the current state.
    • Initializes with most mass on the current layer output to stay close to normal residual flow.
  • Added mHC-style parallel residual streams.

    • New config:
      • residual_num_streams
      • residual_stream_gate_type: scalar | token
      • residual_stream_init: copy | zero
      • residual_stream_read_init
      • residual_stream_write_init
    • residual_num_streams=1 disables the feature.
    • Each layer reads a gated mixture of residual streams, runs the normal decoder
      block, then writes the layer update back into the streams with learned gates.
    • Supports scalar stream gates or per-token stream gates.
    • Keeps attention and MoE internals unchanged.
  • Added residual-flow diagnostics.

    • model.model.residual_gate_stats()
    • model.model.residual_mixer_stats()
    • model.model.residual_stream_stats()
    • Tracks gate means, open/closed fractions, residual depth current/previous
      mass, and stream read/write behavior.

Summary by CodeRabbit

  • New Features

    • Experimental residual gating, causal residual convolution, and multi‑stream residuals added
    • New experimental attention variants and many exp.* toggles for advanced tuning
  • Documentation

    • README and configuration guide expanded with an Experimental package, examples, and presets (Conservative, Balanced, Aggressive, Long Context)
    • Quick Start updated to surface experimental config under exp
  • Refactor

    • Experimental attention utilities reorganized into a dedicated experimental namespace
  • Tests

    • Extensive tests added to cover experimental residuals, attention variants, and generation behavior

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 6, 2026

Caution

Review failed

Failed to post review comments

Note

.coderabbit.yaml has unrecognized properties

CodeRabbit is using all valid settings from your configuration. Unrecognized properties (listed below) have been ignored and may indicate typos or deprecated fields that can be removed.

⚠️ Parsing warnings (1)
Validation error: Unrecognized key(s) in object: 'tools', 'review_guidelines'
⚙️ Configuration instructions
  • Please see the configuration documentation for more information.
  • You can also validate your configuration using the online YAML validator.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Walkthrough

Adds an experimental feature framework under src/exp/, moves several attention helpers into that namespace, extends BiBoConfig with an exp dict plus migration/validation, and integrates gated residuals, causal residual mixers, and multi-stream residual streaming into layers, the model, docs, and tests.

Changes

Experimental Features Framework & Model Integration

Layer / File(s) Summary
Data Shape / Defaults
src/exp/configuration.py, src/configuration_bibo.py
Introduce EXPERIMENTAL_DEFAULTS, EXPERIMENTAL_CONFIG_KEYS; add exp handling and legacy kwargs migration (pop_legacy_experimental_kwargs); add apply_experimental_config and validate_experimental_config; BiBoConfig gains exp param, new flags (norm_topk_prob, output_router_logits), and to_dict() that nests experimental keys under "exp".
Core Implementation — Residual Primitives
src/exp/residual.py
Add BiBoResidualGate, BiBoCausalResidualConv, and BiBoMultiStreamResidual with gating/read/write APIs, causal convolution and dynamic-weight modes, stream read/write semantics, and per-component stats methods.
Core Implementation — Attention Variants Surface
src/exp/attn/__init__.py, src/modeling/attn/base.py, src/modeling/attn/standard.py
Create src.exp.attn public surface (re-export recurrent, sliding, ssmax); update modeling.attn imports to reference experimental attention helpers via absolute paths.
Wiring — Layer Integration
src/modeling/layers.py
BiBoDecoderLayer now constructs attn_residual_gate, mlp_residual_gate, and residual_mixer; forward signature adds residual_history; attention and MLP outputs are gated and mixed with residual history; added residual_gate_stats() and residual_mixer_stats().
Wiring — Model Integration
src/modeling/models.py
BiBoModel adds residual_stream_mixers ModuleList, _init_residual_streams(), residual_history lifecycle in forward(), per-layer read/write via mixers, final read-back into hidden states, and aggregate stats accessors; BiBoForCausalLM now inherits GenerationMixin.
Public API / Packaging
src/exp/__init__.py, src/modeling_bibo.py, src/modeling/attn/__init__.py
Document and expose src.exp; relocate three attention helper imports to src.exp.attn; adjust modeling.attn package exports.
Embed API change
src/modeling/embed.py
BiBoRotaryEmbedding.forward signature becomes position_ids=None, seq_len=None; apply_rotary_pos_emb handles 2D cos/sin by promoting to batch dim; add _rotate_half alias to exports.
Docs / Tests / Examples
README.md, docs/configuration_guide.md, tests/*
README and configuration guide add exp examples, presets, and experimental docs; tests updated to pass exp dict to BiBoConfig, imports changed to src.exp.attn, and many new tests added for residual gates, causal conv, multi-stream residuals, and serialization under exp.

Sequence Diagram

sequenceDiagram
    participant Client as Caller
    participant Model as BiBoModel.forward()
    participant Init as _init_residual_streams()
    participant Layer as BiBoDecoderLayer.forward()
    participant Gate as BiBoResidualGate
    participant Mixer as BiBoCausalResidualConv
    participant Stream as BiBoMultiStreamResidual

    Client->>Model: invoke forward(hidden_states, ...)
    Model->>Init: initialize residual_streams / residual_history
    Init-->>Model: initial residual_history

    loop per layer
      Model->>Layer: call layer(hidden_states, residual_history)
      Layer->>Gate: attn_residual_gate(gate_input)
      Gate-->>Layer: gated attn output
      Layer->>Gate: mlp_residual_gate(gate_input)
      Gate-->>Layer: gated mlp output
      Layer->>Mixer: residual_mixer(residual_history)
      Mixer-->>Layer: mixed residual contribution
      Layer-->>Model: updated hidden_states, layer stats
      Model->>Stream: write(updated hidden_states)
      Stream-->>Model: streams updated, write stats
      Model->>Model: append/trim residual_history
    end

    Model->>Stream: read from final residual stream
    Stream-->>Model: final residual influence applied
    Model-->>Client: return final hidden_states + aggregated stats
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~65 minutes

Poem

🐰

I slipped an exp into the config den,
Gates and streams now mingle in the pen,
Mixers hum, convs recall the past,
Tests march in to prove they last,
Docs sing experimental—let the tinkering begin.

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 53.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title "Attn residuals+m hc" is vague and uses abbreviations that don't convey the main feature additions. While it hints at attention residuals and mHC, it doesn't clearly summarize the PR's primary changes. Expand to a clearer title such as: "Add experimental residual gates, causal residual convolution, and multi-stream residuals" or "Introduce configurable residual-flow mechanisms for attention and MLP layers".
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@adi-kmt adi-kmt force-pushed the attn_residuals+mHC branch from adb9eb8 to 885b515 Compare May 6, 2026 16:25
@IsNoobgrammer
Copy link
Copy Markdown
Owner

aadi just few things here and there;

can you move ssmax back to main ; it was tested and found to be helpul ; we will continue with it ;
rest ; recurent ; sliding ; etc. attn will be in exp. till ablations

also create a seperate modeling/config for bibo in the exp folder ; in src config/modeling we wont add even imports so that when using we dont need to even check for the availability of exp ; src should be fully isolated from exp ; we will slowy push features from exp to src

Comment thread src/modeling/models.py
if output_hidden_states:
all_hidden_states += (hidden_states,)

residual_history_arg = tuple(residual_history) if residual_history is not None else None
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resudial is init as tuple then append ?? use list
chk L292 , L309

Comment thread src/modeling/layers.py
hidden_states = self.mlp(hidden_states)
hidden_states = residual + hidden_states
hidden_states = residual + self.mlp_residual_gate(gate_input, hidden_states)
hidden_states = self.residual_mixer(hidden_states, residual_history)
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am not with you on this one ; wont allow all the residuals to be lost ;

atleast add a config/switch mech here ; where it will be

hidden_states= hidden_states + self.residual_mixer(hidden_states, residual_history) if True else  self.residual_mixer(hidden_states, residual_history)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But shouldn't we allow the model to decide, like full autonomy.

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah we can check that in trails ; but first we need switch for wheater we want model to decide for itself or we want to preserve the old residual

like other things we can do per layer learnable param ; which would be

hidden_states= hidden_states * self.residual_scaling + self.residual_mixer(hidden_states, residual_history) if True else  self.residual_mixer(hidden_states, residual_history)

Comment thread src/exp/residual.py
}


class BiBoCausalResidualConv(nn.Module):
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here add some features which will slightly pritize the current hidden states little bit more than the previous hidden states ;
also as the description says the convulation is by default causal since there is no future layers states to cheat from

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants