Attn residuals+m hc by adi-kmt · Pull Request #6 · IsNoobgrammer/BiBo

adi-kmt · 2026-05-06T16:12:29Z

Add configurable residual routing experiments

Implemented a set of optional residual-flow mechanisms inspired by mHC and
attention residuals. All new paths are config-gated and default to baseline
Transformer behavior when disabled.

Changes:

Added residual write gates around attention and MLP/MoE branch outputs.
- New config:
  - residual_gate_type: none | scalar | token | channel
  - residual_gate_init
- Applies:
  - hidden_states = residual + gate * branch_output
- Keeps the identity residual path intact.
- Supports per-layer scalar gates, per-token gates, and per-token/channel gates.
- Initializes gates near 1.0 to preserve baseline behavior.
Added causal residual-depth convolution.
- New config:
  - residual_mixer_type: none | causal_conv | dynamic_causal_conv
  - residual_conv_kernel_size
  - residual_conv_init
- Mixes previous residual states plus the current layer output over model depth.
- Fully causal over depth: a layer only sees earlier layer states and its own output.
- Does not convolve over sequence tokens, so token-level causality is preserved.
- causal_conv uses a learned static depth kernel per layer.
- dynamic_causal_conv uses token-conditioned depth kernels from the current state.
- Initializes with most mass on the current layer output to stay close to normal residual flow.
Added mHC-style parallel residual streams.
- New config:
  - residual_num_streams
  - residual_stream_gate_type: scalar | token
  - residual_stream_init: copy | zero
  - residual_stream_read_init
  - residual_stream_write_init
- residual_num_streams=1 disables the feature.
- Each layer reads a gated mixture of residual streams, runs the normal decoder
  block, then writes the layer update back into the streams with learned gates.
- Supports scalar stream gates or per-token stream gates.
- Keeps attention and MoE internals unchanged.
Added residual-flow diagnostics.
- model.model.residual_gate_stats()
- model.model.residual_mixer_stats()
- model.model.residual_stream_stats()
- Tracks gate means, open/closed fractions, residual depth current/previous
  mass, and stream read/write behavior.

Summary by CodeRabbit

New Features
- Experimental residual gating, causal residual convolution, and multi‑stream residuals added
- New experimental attention variants and many exp.* toggles for advanced tuning
Documentation
- README and configuration guide expanded with an Experimental package, examples, and presets (Conservative, Balanced, Aggressive, Long Context)
- Quick Start updated to surface experimental config under exp
Refactor
- Experimental attention utilities reorganized into a dedicated experimental namespace
Tests
- Extensive tests added to cover experimental residuals, attention variants, and generation behavior

coderabbitai · 2026-05-06T16:15:54Z

Caution

Review failed

Failed to post review comments

Note

`.coderabbit.yaml` has unrecognized properties

CodeRabbit is using all valid settings from your configuration. Unrecognized properties (listed below) have been ignored and may indicate typos or deprecated fields that can be removed.

⚠️ Parsing warnings (1)

Validation error: Unrecognized key(s) in object: 'tools', 'review_guidelines'

⚙️ Configuration instructions

Please see the configuration documentation for more information.
You can also validate your configuration using the online YAML validator.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Walkthrough

Adds an experimental feature framework under src/exp/, moves several attention helpers into that namespace, extends BiBoConfig with an exp dict plus migration/validation, and integrates gated residuals, causal residual mixers, and multi-stream residual streaming into layers, the model, docs, and tests.

Changes

Experimental Features Framework & Model Integration

Layer / File(s)	Summary
Data Shape / Defaults `src/exp/configuration.py`, `src/configuration_bibo.py`	Introduce `EXPERIMENTAL_DEFAULTS`, `EXPERIMENTAL_CONFIG_KEYS`; add `exp` handling and legacy kwargs migration (`pop_legacy_experimental_kwargs`); add `apply_experimental_config` and `validate_experimental_config`; BiBoConfig gains `exp` param, new flags (`norm_topk_prob`, `output_router_logits`), and `to_dict()` that nests experimental keys under `"exp"`.
Core Implementation — Residual Primitives `src/exp/residual.py`	Add `BiBoResidualGate`, `BiBoCausalResidualConv`, and `BiBoMultiStreamResidual` with gating/read/write APIs, causal convolution and dynamic-weight modes, stream read/write semantics, and per-component stats methods.
Core Implementation — Attention Variants Surface `src/exp/attn/__init__.py`, `src/modeling/attn/base.py`, `src/modeling/attn/standard.py`	Create `src.exp.attn` public surface (re-export `recurrent`, `sliding`, `ssmax`); update `modeling.attn` imports to reference experimental attention helpers via absolute paths.
Wiring — Layer Integration `src/modeling/layers.py`	`BiBoDecoderLayer` now constructs `attn_residual_gate`, `mlp_residual_gate`, and `residual_mixer`; forward signature adds `residual_history`; attention and MLP outputs are gated and mixed with residual history; added `residual_gate_stats()` and `residual_mixer_stats()`.
Wiring — Model Integration `src/modeling/models.py`	`BiBoModel` adds `residual_stream_mixers` ModuleList, `_init_residual_streams()`, residual_history lifecycle in `forward()`, per-layer read/write via mixers, final read-back into hidden states, and aggregate stats accessors; `BiBoForCausalLM` now inherits `GenerationMixin`.
Public API / Packaging `src/exp/__init__.py`, `src/modeling_bibo.py`, `src/modeling/attn/__init__.py`	Document and expose `src.exp`; relocate three attention helper imports to `src.exp.attn`; adjust `modeling.attn` package exports.
Embed API change `src/modeling/embed.py`	`BiBoRotaryEmbedding.forward` signature becomes `position_ids=None, seq_len=None`; `apply_rotary_pos_emb` handles 2D cos/sin by promoting to batch dim; add `_rotate_half` alias to exports.
Docs / Tests / Examples `README.md`, `docs/configuration_guide.md`, `tests/*`	README and configuration guide add `exp` examples, presets, and experimental docs; tests updated to pass `exp` dict to BiBoConfig, imports changed to `src.exp.attn`, and many new tests added for residual gates, causal conv, multi-stream residuals, and serialization under `exp`.

Sequence Diagram

sequenceDiagram
    participant Client as Caller
    participant Model as BiBoModel.forward()
    participant Init as _init_residual_streams()
    participant Layer as BiBoDecoderLayer.forward()
    participant Gate as BiBoResidualGate
    participant Mixer as BiBoCausalResidualConv
    participant Stream as BiBoMultiStreamResidual

    Client->>Model: invoke forward(hidden_states, ...)
    Model->>Init: initialize residual_streams / residual_history
    Init-->>Model: initial residual_history

    loop per layer
      Model->>Layer: call layer(hidden_states, residual_history)
      Layer->>Gate: attn_residual_gate(gate_input)
      Gate-->>Layer: gated attn output
      Layer->>Gate: mlp_residual_gate(gate_input)
      Gate-->>Layer: gated mlp output
      Layer->>Mixer: residual_mixer(residual_history)
      Mixer-->>Layer: mixed residual contribution
      Layer-->>Model: updated hidden_states, layer stats
      Model->>Stream: write(updated hidden_states)
      Stream-->>Model: streams updated, write stats
      Model->>Model: append/trim residual_history
    end

    Model->>Stream: read from final residual stream
    Stream-->>Model: final residual influence applied
    Model-->>Client: return final hidden_states + aggregated stats

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~65 minutes

Poem

🐰

I slipped an exp into the config den,
Gates and streams now mingle in the pen,
Mixers hum, convs recall the past,
Tests march in to prove they last,
Docs sing experimental—let the tinkering begin.

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 53.33% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The title "Attn residuals+m hc" is vague and uses abbreviations that don't convey the main feature additions. While it hints at attention residuals and mHC, it doesn't clearly summarize the PR's primary changes.	Expand to a clearer title such as: "Add experimental residual gates, causal residual convolution, and multi-stream residuals" or "Introduce configurable residual-flow mechanisms for attention and MLP layers".

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

IsNoobgrammer · 2026-05-06T18:06:26Z

aadi just few things here and there;

can you move ssmax back to main ; it was tested and found to be helpul ; we will continue with it ;
rest ; recurent ; sliding ; etc. attn will be in exp. till ablations

also create a seperate modeling/config for bibo in the exp folder ; in src config/modeling we wont add even imports so that when using we dont need to even check for the availability of exp ; src should be fully isolated from exp ; we will slowy push features from exp to src

IsNoobgrammer · 2026-05-06T18:18:07Z

            if output_hidden_states:
                all_hidden_states += (hidden_states,)

+            residual_history_arg = tuple(residual_history) if residual_history is not None else None


resudial is init as tuple then append ?? use list
chk L292 , L309

IsNoobgrammer · 2026-05-06T18:24:01Z

        hidden_states = self.mlp(hidden_states)
-        hidden_states = residual + hidden_states
+        hidden_states = residual + self.mlp_residual_gate(gate_input, hidden_states)
+        hidden_states = self.residual_mixer(hidden_states, residual_history)


i am not with you on this one ; wont allow all the residuals to be lost ;

atleast add a config/switch mech here ; where it will be

hidden_states= hidden_states + self.residual_mixer(hidden_states, residual_history) if True else self.residual_mixer(hidden_states, residual_history)

But shouldn't we allow the model to decide, like full autonomy.

yeah we can check that in trails ; but first we need switch for wheater we want model to decide for itself or we want to preserve the old residual

like other things we can do per layer learnable param ; which would be

hidden_states= hidden_states * self.residual_scaling + self.residual_mixer(hidden_states, residual_history) if True else self.residual_mixer(hidden_states, residual_history)

IsNoobgrammer · 2026-05-06T18:31:02Z

+        }
+
+
+class BiBoCausalResidualConv(nn.Module):


here add some features which will slightly pritize the current hidden states little bit more than the previous hidden states ;
also as the description says the convulation is by default causal since there is no future layers states to cheat from

adi-kmt added 8 commits May 6, 2026 21:51

Add attn residuals with mHC (but use conv layer), then mHC with gating

9c2d40b

fix bugs

6cbb68f

Move experimental attention and residual features

556208e

Fix residual experiment edge cases

0e015d9

Respect canonical exp config precedence

6bb5d8a

Preserve HF generation support

5037aee

Fix cached decode position handling

642d3d3

Fix rebase compatibility issues

885b515

adi-kmt force-pushed the attn_residuals+mHC branch from adb9eb8 to 885b515 Compare May 6, 2026 16:25

IsNoobgrammer requested changes May 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Attn residuals+m hc#6

Attn residuals+m hc#6
adi-kmt wants to merge 8 commits into
IsNoobgrammer:mainfrom
adi-kmt:attn_residuals+mHC

adi-kmt commented May 6, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 6, 2026 •

edited

Loading

Review failed

`.coderabbit.yaml` has unrecognized properties

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

IsNoobgrammer commented May 6, 2026

Uh oh!

IsNoobgrammer May 6, 2026

Uh oh!

IsNoobgrammer May 6, 2026

Uh oh!

adi-kmt May 14, 2026

Uh oh!

IsNoobgrammer May 14, 2026

Uh oh!

IsNoobgrammer May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		}


		class BiBoCausalResidualConv(nn.Module):

Conversation

adi-kmt commented May 6, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

.coderabbit.yaml has unrecognized properties

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

IsNoobgrammer commented May 6, 2026

Uh oh!

IsNoobgrammer May 6, 2026

Choose a reason for hiding this comment

Uh oh!

IsNoobgrammer May 6, 2026

Choose a reason for hiding this comment

Uh oh!

adi-kmt May 14, 2026

Choose a reason for hiding this comment

Uh oh!

IsNoobgrammer May 14, 2026

Choose a reason for hiding this comment

Uh oh!

IsNoobgrammer May 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

adi-kmt commented May 6, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 6, 2026 •

edited

Loading

`.coderabbit.yaml` has unrecognized properties