Add operation-based indexing pipeline by Morriar · Pull Request #798 · Shopify/rubydex

Morriar · 2026-05-08T16:52:26Z

What

Introduces an intermediate representation (IR) between Ruby parsing and graph construction. Instead of the indexer writing directly to a LocalGraph, a new pipeline produces an ordered Vec<Operation> (an explicit list of what each file contributes to the graph), then applies those operations to build the graph.

Current pipeline:

graph LR
    AST --> RubyIndexer --> LocalGraph --> Graph --> Resolution

New pipeline:

graph LR
    AST --> RubyOperationBuilder --> Operations["Vec&lt;Operation&gt;"] --> OperationApplier --> LocalGraph --> Graph --> Resolution

(This is an intermediate step to ensure parity with the current indexer / local graph and reduce the scope of this PR.)

Target pipeline:

graph LR
    AST --> Operations["Vec&lt;Operation&gt;"] --> Graph["Graph (resolution applies operations directly)"]

Once we commit to this direction, we can remove the LocalGraph and feed the operations directly to the resolver to let it create both the definitions and the declarations.

Why

Four motivations that reinforce each other:

DSL plugin API

Plugins need to contribute definitions to the graph. Without operations, plugins mutate LocalGraph internals: fragile, tightly coupled, and bad for parallelism (Ruby plugins calling into the graph would require synchronization).

With operations, a plugin just produces Vec<Operation>, a pure data structure that can be built in parallel and applied later. Operations become the public contract between Rubydex and plugins.

class Post < ApplicationRecord
  has_many :comments
end

An ActiveRecord plugin sees has_many :comments and emits operations. It doesn't touch graph internals, it speaks the same operation protocol as the built-in indexer.

EnterMethod(comments)
ExitScope
EnterMethod(comments=)
ExitScope
EnterMethod(comment_ids)
ExitScope
EnterMethod(comment_ids=)
ExitScope

Unloading

Each file's contribution is its operation list. Negating the operations (walk in reverse, opposite effect) cleanly unloads the file without graph diffing.

# user.rb produces:
# 1. EnterClass(User)
# 2.   EnterMethod(name)
# 3.   ExitScope
# 4.   EnterMethod(email)
# 5.   ExitScope
# 6. ExitScope

# To unload user.rb: negate in reverse order
# 6. (no-op)
# 5. (no-op)
# 4. RemoveMethod(email)
# 3. (no-op)
# 2. RemoveMethod(name)
# 1. RemoveClass(User)

Because DSL plugins produce the same IR primitives as the built-in indexer, we know how to undo their contributions for free: no plugin-specific cleanup logic, no tracking which definitions in the graph came from which plugin.

Incremental re-indexing

Because operations are a flat list, we can diff the old and new operation lists when a file changes. Only the delta gets applied: remove operations that disappeared, add new ones. Most edits change very few operations, so the diff is small.

# user.rb: rename `name` to `full_name`

Old operations:                    New operations:
  EnterClass(User)                   EnterClass(User)
    EnterMethod(name)          →       EnterMethod(full_name)
    ExitScope                          ExitScope
    EnterMethod(email)                 EnterMethod(email)
    ExitScope                          ExitScope
  ExitScope                          ExitScope

# Diff: remove EnterMethod(name), add EnterMethod(full_name)

No full rebuild, no negating the entire file, just the minimal set of changes.

Order-dependent diagnostics

Operations preserve source order, which the current graph doesn't. Consider:

class Foo
  private_constant :BAR
  BAR = 42
end

Ruby raises NameError because private_constant :BAR runs before BAR exists. Today Rubydex can't catch this because both the constant and the visibility change become unordered definitions in the graph, and resolution wires them together regardless of order. With operations, the ordering is explicit (SetConstantVisibility before DefineConstant), and a diagnostic pass can detect the forward reference.

How

Operation enum (operation/mod.rs): ~20 variants covering namespace scopes (EnterClass, EnterModule, EnterSingletonClass, ExitScope), method operations (EnterMethod, DefineAttribute, AliasMethod, SetMethodVisibility, SetDefaultVisibility), constants (DefineConstant, AliasConstant, SetConstantVisibility), variables (DefineInstanceVariable, DefineClassVariable, DefineGlobalVariable), mixins (Mixin), and references (ReferenceConstant, ReferenceMethod). Each variant is its own struct so consumers can take fields by value or by reference.

OperationPrinter (operation/printer.rs): Debug formatter that pretty-prints Vec<Operation> with indentation reflecting scope nesting. Used for debugging and testing the builder output.

RubyOperationBuilder (operation/ruby_builder.rs): Walks the Ruby AST and produces Vec<Operation>. Same visitor pattern as RubyIndexer, but emits operations instead of mutating a graph. ~3100 lines, 1:1 correspondence with every RubyIndexer behavior.

OperationApplier (operation/applier.rs): Converts Vec<Operation> → LocalGraph. Consumes operations by value (zero clones). This is scaffolding that exists only because the current merge pipeline expects a LocalGraph. In the target architecture, resolution consumes operations directly and both OperationApplier and LocalGraph go away.

IndexerBackend enum (indexing.rs, main.rs): --indexer ruby_indexer|operation_builder CLI flag to switch backends. Both backends run the same shared test suite (166 test functions × 2 backends).

Parity verification

Both the indexer test suite (166 tests) and the resolution test suite (208 tests) run with both backends using a #[path] module trick — each parent module provides a backend() function via super::backend(), so the same test file exercises both RubyIndexer and OperationBuilder.

Both backends produce identical output on Shopify core:

	RubyIndexer	OperationBuilder
Total definitions	926,926	926,926
Total declarations	749,360	749,360

Performance

	RubyIndexer	OperationBuilder	Delta
Indexing time	3.98s	4.02s	+1.0%
Total time	29.3s	29.9s	+1.9%
Memory (RSS)	3,063 MB	3,143 MB	+2.6%

Essentially identical performance. The intermediate Vec<Operation> per file adds a very small overhead.

Move definition ID computation (namespace_definition_id, method_definition_id) from inline format strings in each definition type into model::ids so they can be reused by the operation applier.

Introduce the Operation enum with 18 struct variants (EnterClass, EnterModule, EnterMethod, DefineConstant, etc.) plus ExitScope. Each struct carries its own data; scope context is provided by surrounding Enter/Exit operations.

Implement OperationVisitor to produce a human-readable, indented text representation of operations. Used by builder tests to assert on output.

Walk the Ruby AST (via ruby_prism) to produce a Vec<Operation>. This is the first phase of the two-phase indexing pipeline: source code to operations.

Implement OperationVisitor to convert operations into definitions in a LocalGraph. This is the second phase of the pipeline: operations to indexed definitions. LocalGraph::from_parts allows constructing a graph from pre-built parts (strings, names, document) so the applier can build graphs without going through the indexer.

Introduce IndexerBackend to switch between RubyIndexer and the operation pipeline. Add build_local_graph to dispatch indexing based on the backend.

Both RubyIndexer and OperationApplier now run the same tests via a shared ruby_indexer_tests.rs included with #[path]. Each parent provides a backend() function to select which pipeline to use.

Wire up the IndexerBackend enum to the CLI so users can choose between the original RubyIndexer and the operation pipeline.

vinistock

I really believe moving towards an IR will give us a much nicer way to express all of the richness in Ruby code.

It's challenging to reason about the implementation because we are supporting both mechanisms temporarily and without a VM to go along with the IR, but I would rather we pull the future forward and iterate fast on the refactor than holding this back.

Here are some questions/things to explore in case they prove useful:

It would be nice to test out more granular instructions to see what the code looks like. We don't have to do it for all definitions, but maybe experiment with classes by supporting Class.new and Struct.new to see what emerges
What do the reverse instructions look like for incremental resolution? Is it possible to generically reverse all operations (something like [IR::Undo, IR::DefineClass])? Or do we need to create each reverse operation one by one (e.g.: [IR::UndefClass])?
After we have a proper VM that processes the IR directly without the intermediate local graph, how do we account for ordering and parallelism in processing the chunks? For example, what happens if we start interpreting one chunk and hit a constant dependency? I'd assume there's no way to order the chunks correctly because you can have different constant dependencies per instruction (and not per chunk), so is it possible to pause the execution of a chunk and resume after? Or do we need full dependency tracking?
An interesting aspect is that the boundary between compiler and VM can be quite blurry. For example, if we find a private in the code, do we apply the visibility to the IR during compilation or does it become an instruction that only gets processed by the VM? For this particular example, I think the answer is the latter, but I'm curious about what other boundaries will change

vinistock · 2026-05-11T13:39:07Z


 /// The receiver of a singleton method definition.
-#[derive(Debug)]
+#[derive(Debug, Clone)]


Do we need the clone?

vinistock · 2026-05-11T13:40:00Z

+#[must_use]
+pub fn namespace_definition_id(uri_id: UriId, offset: &Offset, name_id: NameId) -> DefinitionId {
+    DefinitionId::from(&format!("{}{}{}", *uri_id, offset.start(), *name_id))
+}
+
+#[must_use]
+pub fn method_definition_id(


Can we define these as associated functions of DefinitionId instead? Basically, so that we can do this instead:

DefinitionId::for_namespace() DefinitionId::for_method()

vinistock · 2026-05-11T13:46:15Z

The mod.rs thing is the old Rust naming style. Can we use this instead?

rubydex/src/operation.rs # defines the module /operation/whatever.rs

Also, my vote goes to naming it IR instead of operation.

vinistock · 2026-05-11T13:48:29Z

+}
+
+#[derive(Debug)]
+pub struct EnterClass {


Let's default all of these fields to private and expose with getters.

vinistock · 2026-05-11T13:55:22Z

+    SetMethodVisibility, Target,
+};
+
+struct OperationPrinter<'a> {


Not necessary for this PR, but it might be useful to consider the implementation as an M x N matrix of IR compilers and VMs.

For the compilers, we already have a trait with default behaviours to guide the implementation (Prism's visit trait). We should create a virtual machine trait that defaults to doing nothing for all instructions. It's a small thing, but it makes it easier to discover which instructions you haven't handled yet and you get better LSP support.

vinistock · 2026-05-11T13:58:02Z

@@ -0,0 +1,2915 @@
+//! Visit the Ruby AST and produce an ordered list of operations.
+//!
+//! Walks the parsed AST and produces `Vec<Operation>` that can later be applied


Idea: is it worth encapsulating a list of compiled instructions into a Chunk?

vinistock · 2026-05-11T14:06:25Z

+    /// A lexical scope (class/module keyword) that produces a new constant resolution scope.
+    LexicalScope { name_id: NameId, is_module: bool },
+    /// An owner that doesn't produce a lexical scope (Class.new/Module.new).
+    Owner { name_id: NameId, is_module: bool },


On the other approach, we walk the nesting backwards to figure out if we're inside a module or not. Is storing the boolean here mostly a performance improvement? Or is there some other aspect?

vinistock · 2026-05-11T14:07:32Z

+/// without looking up definitions. Distinct from `operation::Target` which represents
+/// the source-level receiver without resolved names.
+#[derive(Clone, Copy)]
+enum NestingReceiver {


I think it would be nicer to extract this into definitions.rs since the case of having something where the receiver is either self or a constant is so common. We can probably reuse the same enum for method and method alias definitions.

vinistock · 2026-05-11T14:09:01Z

+            end
+            ",
+            "
+            EnterClass(Foo)


I love this style of tests. Very easy to visually verify what the expected IR is.

Morriar requested a review from a team as a code owner May 8, 2026 16:52

Morriar added 9 commits May 8, 2026 12:55

Extract shared DefinitionId computation functions

018abbe

Move definition ID computation (namespace_definition_id, method_definition_id) from inline format strings in each definition type into model::ids so they can be reused by the operation applier.

Add Operation enum with per-variant structs

b4de678

Introduce the Operation enum with 18 struct variants (EnterClass, EnterModule, EnterMethod, DefineConstant, etc.) plus ExitScope. Each struct carries its own data; scope context is provided by surrounding Enter/Exit operations.

Add OperationPrinter for debug formatting

849c75f

Implement OperationVisitor to produce a human-readable, indented text representation of operations. Used by builder tests to assert on output.

Add RubyOperationBuilder

33859f7

Walk the Ruby AST (via ruby_prism) to produce a Vec<Operation>. This is the first phase of the two-phase indexing pipeline: source code to operations.

Add IndexerBackend enum and build_local_graph dispatcher

36e3326

Introduce IndexerBackend to switch between RubyIndexer and the operation pipeline. Add build_local_graph to dispatch indexing based on the backend.

Share indexer test suite across backends

d2b03f4

Both RubyIndexer and OperationApplier now run the same tests via a shared ruby_indexer_tests.rs included with #[path]. Each parent provides a backend() function to select which pipeline to use.

Share resolution test suite across backends

05a558a

Add --indexer CLI flag to select indexing backend

fd915fa

Wire up the IndexerBackend enum to the CLI so users can choose between the original RubyIndexer and the operation pipeline.

Morriar force-pushed the at-index-operations branch from e4e719f to fd915fa Compare May 8, 2026 16:58

vinistock approved these changes May 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add operation-based indexing pipeline#798

Add operation-based indexing pipeline#798
Morriar wants to merge 9 commits into
mainfrom
at-index-operations

Morriar commented May 8, 2026 •

edited

Loading

Uh oh!

vinistock left a comment

Uh oh!

vinistock May 11, 2026

Uh oh!

vinistock May 11, 2026

Uh oh!

vinistock May 11, 2026

Uh oh!

vinistock May 11, 2026

Uh oh!

vinistock May 11, 2026

Uh oh!

vinistock May 11, 2026

Uh oh!

vinistock May 11, 2026

Uh oh!

vinistock May 11, 2026

Uh oh!

vinistock May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Morriar commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

How

Parity verification

Performance

Uh oh!

vinistock left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Morriar commented May 8, 2026 •

edited

Loading