Attn residuals+m hc#6
Conversation
|
Caution Review failedFailed to post review comments Note
|
| Layer / File(s) | Summary |
|---|---|
Data Shape / Defaults src/exp/configuration.py, src/configuration_bibo.py |
Introduce EXPERIMENTAL_DEFAULTS, EXPERIMENTAL_CONFIG_KEYS; add exp handling and legacy kwargs migration (pop_legacy_experimental_kwargs); add apply_experimental_config and validate_experimental_config; BiBoConfig gains exp param, new flags (norm_topk_prob, output_router_logits), and to_dict() that nests experimental keys under "exp". |
Core Implementation — Residual Primitives src/exp/residual.py |
Add BiBoResidualGate, BiBoCausalResidualConv, and BiBoMultiStreamResidual with gating/read/write APIs, causal convolution and dynamic-weight modes, stream read/write semantics, and per-component stats methods. |
Core Implementation — Attention Variants Surface src/exp/attn/__init__.py, src/modeling/attn/base.py, src/modeling/attn/standard.py |
Create src.exp.attn public surface (re-export recurrent, sliding, ssmax); update modeling.attn imports to reference experimental attention helpers via absolute paths. |
Wiring — Layer Integration src/modeling/layers.py |
BiBoDecoderLayer now constructs attn_residual_gate, mlp_residual_gate, and residual_mixer; forward signature adds residual_history; attention and MLP outputs are gated and mixed with residual history; added residual_gate_stats() and residual_mixer_stats(). |
Wiring — Model Integration src/modeling/models.py |
BiBoModel adds residual_stream_mixers ModuleList, _init_residual_streams(), residual_history lifecycle in forward(), per-layer read/write via mixers, final read-back into hidden states, and aggregate stats accessors; BiBoForCausalLM now inherits GenerationMixin. |
Public API / Packaging src/exp/__init__.py, src/modeling_bibo.py, src/modeling/attn/__init__.py |
Document and expose src.exp; relocate three attention helper imports to src.exp.attn; adjust modeling.attn package exports. |
Embed API change src/modeling/embed.py |
BiBoRotaryEmbedding.forward signature becomes position_ids=None, seq_len=None; apply_rotary_pos_emb handles 2D cos/sin by promoting to batch dim; add _rotate_half alias to exports. |
Docs / Tests / Examples README.md, docs/configuration_guide.md, tests/* |
README and configuration guide add exp examples, presets, and experimental docs; tests updated to pass exp dict to BiBoConfig, imports changed to src.exp.attn, and many new tests added for residual gates, causal conv, multi-stream residuals, and serialization under exp. |
Sequence Diagram
sequenceDiagram
participant Client as Caller
participant Model as BiBoModel.forward()
participant Init as _init_residual_streams()
participant Layer as BiBoDecoderLayer.forward()
participant Gate as BiBoResidualGate
participant Mixer as BiBoCausalResidualConv
participant Stream as BiBoMultiStreamResidual
Client->>Model: invoke forward(hidden_states, ...)
Model->>Init: initialize residual_streams / residual_history
Init-->>Model: initial residual_history
loop per layer
Model->>Layer: call layer(hidden_states, residual_history)
Layer->>Gate: attn_residual_gate(gate_input)
Gate-->>Layer: gated attn output
Layer->>Gate: mlp_residual_gate(gate_input)
Gate-->>Layer: gated mlp output
Layer->>Mixer: residual_mixer(residual_history)
Mixer-->>Layer: mixed residual contribution
Layer-->>Model: updated hidden_states, layer stats
Model->>Stream: write(updated hidden_states)
Stream-->>Model: streams updated, write stats
Model->>Model: append/trim residual_history
end
Model->>Stream: read from final residual stream
Stream-->>Model: final residual influence applied
Model-->>Client: return final hidden_states + aggregated stats
Estimated code review effort
🎯 4 (Complex) | ⏱️ ~65 minutes
Poem
🐰
I slipped an
expinto the config den,
Gates and streams now mingle in the pen,
Mixers hum, convs recall the past,
Tests march in to prove they last,
Docs sing experimental—let the tinkering begin.
🚥 Pre-merge checks | ✅ 3 | ❌ 2
❌ Failed checks (1 warning, 1 inconclusive)
| Check name | Status | Explanation | Resolution |
|---|---|---|---|
| Docstring Coverage | Docstring coverage is 53.33% which is insufficient. The required threshold is 80.00%. | Write docstrings for the functions missing them to satisfy the coverage threshold. | |
| Title check | ❓ Inconclusive | The title "Attn residuals+m hc" is vague and uses abbreviations that don't convey the main feature additions. While it hints at attention residuals and mHC, it doesn't clearly summarize the PR's primary changes. | Expand to a clearer title such as: "Add experimental residual gates, causal residual convolution, and multi-stream residuals" or "Introduce configurable residual-flow mechanisms for attention and MLP layers". |
✅ Passed checks (3 passed)
| Check name | Status | Explanation |
|---|---|---|
| Description Check | ✅ Passed | Check skipped - CodeRabbit’s high-level summary is enabled. |
| Linked Issues check | ✅ Passed | Check skipped because no linked issues were found for this pull request. |
| Out of Scope Changes check | ✅ Passed | Check skipped because no linked issues were found for this pull request. |
✏️ Tip: You can configure your own custom pre-merge checks in the settings.
✨ Finishing Touches
🧪 Generate unit tests (beta)
- Create PR with unit tests
Comment @coderabbitai help to get the list of available commands and usage tips.
adb9eb8 to
885b515
Compare
|
aadi just few things here and there; can you move ssmax back to main ; it was tested and found to be helpul ; we will continue with it ; also create a seperate modeling/config for bibo in the exp folder ; in src config/modeling we wont add even imports so that when using we dont need to even check for the availability of exp ; src should be fully isolated from exp ; we will slowy push features from exp to src |
| if output_hidden_states: | ||
| all_hidden_states += (hidden_states,) | ||
|
|
||
| residual_history_arg = tuple(residual_history) if residual_history is not None else None |
There was a problem hiding this comment.
resudial is init as tuple then append ?? use list
chk L292 , L309
| hidden_states = self.mlp(hidden_states) | ||
| hidden_states = residual + hidden_states | ||
| hidden_states = residual + self.mlp_residual_gate(gate_input, hidden_states) | ||
| hidden_states = self.residual_mixer(hidden_states, residual_history) |
There was a problem hiding this comment.
i am not with you on this one ; wont allow all the residuals to be lost ;
atleast add a config/switch mech here ; where it will be
hidden_states= hidden_states + self.residual_mixer(hidden_states, residual_history) if True else self.residual_mixer(hidden_states, residual_history)There was a problem hiding this comment.
But shouldn't we allow the model to decide, like full autonomy.
There was a problem hiding this comment.
yeah we can check that in trails ; but first we need switch for wheater we want model to decide for itself or we want to preserve the old residual
like other things we can do per layer learnable param ; which would be
hidden_states= hidden_states * self.residual_scaling + self.residual_mixer(hidden_states, residual_history) if True else self.residual_mixer(hidden_states, residual_history)| } | ||
|
|
||
|
|
||
| class BiBoCausalResidualConv(nn.Module): |
There was a problem hiding this comment.
here add some features which will slightly pritize the current hidden states little bit more than the previous hidden states ;
also as the description says the convulation is by default causal since there is no future layers states to cheat from
Add configurable residual routing experiments
Implemented a set of optional residual-flow mechanisms inspired by mHC and
attention residuals. All new paths are config-gated and default to baseline
Transformer behavior when disabled.
Changes:
Added residual write gates around attention and MLP/MoE branch outputs.
Added causal residual-depth convolution.
Added mHC-style parallel residual streams.
block, then writes the layer update back into the streams with learned gates.
Added residual-flow diagnostics.
mass, and stream read/write behavior.
Summary by CodeRabbit
New Features
Documentation
Refactor
Tests