Skip to content

Add operation-based indexing pipeline#798

Open
Morriar wants to merge 9 commits into
mainfrom
at-index-operations
Open

Add operation-based indexing pipeline#798
Morriar wants to merge 9 commits into
mainfrom
at-index-operations

Conversation

@Morriar
Copy link
Copy Markdown
Contributor

@Morriar Morriar commented May 8, 2026

What

Introduces an intermediate representation (IR) between Ruby parsing and graph construction. Instead of the indexer writing directly to a LocalGraph, a new pipeline produces an ordered Vec<Operation> (an explicit list of what each file contributes to the graph), then applies those operations to build the graph.

Current pipeline:

graph LR
    AST --> RubyIndexer --> LocalGraph --> Graph --> Resolution
Loading

New pipeline:

graph LR
    AST --> RubyOperationBuilder --> Operations["Vec&lt;Operation&gt;"] --> OperationApplier --> LocalGraph --> Graph --> Resolution
Loading

(This is an intermediate step to ensure parity with the current indexer / local graph and reduce the scope of this PR.)

Target pipeline:

graph LR
    AST --> Operations["Vec&lt;Operation&gt;"] --> Graph["Graph (resolution applies operations directly)"]
Loading

Once we commit to this direction, we can remove the LocalGraph and feed the operations directly to the resolver to let it create both the definitions and the declarations.

Why

Four motivations that reinforce each other:

DSL plugin API

Plugins need to contribute definitions to the graph. Without operations, plugins mutate LocalGraph internals: fragile, tightly coupled, and bad for parallelism (Ruby plugins calling into the graph would require synchronization).

With operations, a plugin just produces Vec<Operation>, a pure data structure that can be built in parallel and applied later. Operations become the public contract between Rubydex and plugins.

class Post < ApplicationRecord
  has_many :comments
end

An ActiveRecord plugin sees has_many :comments and emits operations. It doesn't touch graph internals, it speaks the same operation protocol as the built-in indexer.

EnterMethod(comments)
ExitScope
EnterMethod(comments=)
ExitScope
EnterMethod(comment_ids)
ExitScope
EnterMethod(comment_ids=)
ExitScope

Unloading

Each file's contribution is its operation list. Negating the operations (walk in reverse, opposite effect) cleanly unloads the file without graph diffing.

# user.rb produces:
# 1. EnterClass(User)
# 2.   EnterMethod(name)
# 3.   ExitScope
# 4.   EnterMethod(email)
# 5.   ExitScope
# 6. ExitScope

# To unload user.rb: negate in reverse order
# 6. (no-op)
# 5. (no-op)
# 4. RemoveMethod(email)
# 3. (no-op)
# 2. RemoveMethod(name)
# 1. RemoveClass(User)

Because DSL plugins produce the same IR primitives as the built-in indexer, we know how to undo their contributions for free: no plugin-specific cleanup logic, no tracking which definitions in the graph came from which plugin.

Incremental re-indexing

Because operations are a flat list, we can diff the old and new operation lists when a file changes. Only the delta gets applied: remove operations that disappeared, add new ones. Most edits change very few operations, so the diff is small.

# user.rb: rename `name` to `full_name`

Old operations:                    New operations:
  EnterClass(User)                   EnterClass(User)
    EnterMethod(name)          →       EnterMethod(full_name)
    ExitScope                          ExitScope
    EnterMethod(email)                 EnterMethod(email)
    ExitScope                          ExitScope
  ExitScope                          ExitScope

# Diff: remove EnterMethod(name), add EnterMethod(full_name)

No full rebuild, no negating the entire file, just the minimal set of changes.

Order-dependent diagnostics

Operations preserve source order, which the current graph doesn't. Consider:

class Foo
  private_constant :BAR
  BAR = 42
end

Ruby raises NameError because private_constant :BAR runs before BAR exists. Today Rubydex can't catch this because both the constant and the visibility change become unordered definitions in the graph, and resolution wires them together regardless of order. With operations, the ordering is explicit (SetConstantVisibility before DefineConstant), and a diagnostic pass can detect the forward reference.

How

Operation enum (operation/mod.rs): ~20 variants covering namespace scopes (EnterClass, EnterModule, EnterSingletonClass, ExitScope), method operations (EnterMethod, DefineAttribute, AliasMethod, SetMethodVisibility, SetDefaultVisibility), constants (DefineConstant, AliasConstant, SetConstantVisibility), variables (DefineInstanceVariable, DefineClassVariable, DefineGlobalVariable), mixins (Mixin), and references (ReferenceConstant, ReferenceMethod). Each variant is its own struct so consumers can take fields by value or by reference.

OperationPrinter (operation/printer.rs): Debug formatter that pretty-prints Vec<Operation> with indentation reflecting scope nesting. Used for debugging and testing the builder output.

RubyOperationBuilder (operation/ruby_builder.rs): Walks the Ruby AST and produces Vec<Operation>. Same visitor pattern as RubyIndexer, but emits operations instead of mutating a graph. ~3100 lines, 1:1 correspondence with every RubyIndexer behavior.

OperationApplier (operation/applier.rs): Converts Vec<Operation>LocalGraph. Consumes operations by value (zero clones). This is scaffolding that exists only because the current merge pipeline expects a LocalGraph. In the target architecture, resolution consumes operations directly and both OperationApplier and LocalGraph go away.

IndexerBackend enum (indexing.rs, main.rs): --indexer ruby_indexer|operation_builder CLI flag to switch backends. Both backends run the same shared test suite (166 test functions × 2 backends).

Parity verification

Both the indexer test suite (166 tests) and the resolution test suite (208 tests) run with both backends using a #[path] module trick — each parent module provides a backend() function via super::backend(), so the same test file exercises both RubyIndexer and OperationBuilder.

Both backends produce identical output on Shopify core:

RubyIndexer OperationBuilder
Total definitions 926,926 926,926
Total declarations 749,360 749,360

Performance

RubyIndexer OperationBuilder Delta
Indexing time 3.98s 4.02s +1.0%
Total time 29.3s 29.9s +1.9%
Memory (RSS) 3,063 MB 3,143 MB +2.6%

Essentially identical performance. The intermediate Vec<Operation> per file adds a very small overhead.

@Morriar Morriar requested a review from a team as a code owner May 8, 2026 16:52
Morriar added 9 commits May 8, 2026 12:55
Move definition ID computation (namespace_definition_id, method_definition_id)
from inline format strings in each definition type into model::ids so they can
be reused by the operation applier.
Introduce the Operation enum with 18 struct variants (EnterClass, EnterModule,
EnterMethod, DefineConstant, etc.) plus ExitScope. Each struct carries its own
data; scope context is provided by surrounding Enter/Exit operations.
Implement OperationVisitor to produce a human-readable, indented text
representation of operations. Used by builder tests to assert on output.
Walk the Ruby AST (via ruby_prism) to produce a Vec<Operation>. This is the
first phase of the two-phase indexing pipeline: source code to operations.
Implement OperationVisitor to convert operations into definitions in a LocalGraph.
This is the second phase of the pipeline: operations to indexed definitions.

LocalGraph::from_parts allows constructing a graph from pre-built parts (strings,
names, document) so the applier can build graphs without going through the indexer.
Introduce IndexerBackend to switch between RubyIndexer and the operation
pipeline. Add build_local_graph to dispatch indexing based on the backend.
Both RubyIndexer and OperationApplier now run the same tests via a shared
ruby_indexer_tests.rs included with #[path]. Each parent provides a backend()
function to select which pipeline to use.
Wire up the IndexerBackend enum to the CLI so users can choose between
the original RubyIndexer and the operation pipeline.
@Morriar Morriar force-pushed the at-index-operations branch from e4e719f to fd915fa Compare May 8, 2026 16:58
Copy link
Copy Markdown
Member

@vinistock vinistock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really believe moving towards an IR will give us a much nicer way to express all of the richness in Ruby code.

It's challenging to reason about the implementation because we are supporting both mechanisms temporarily and without a VM to go along with the IR, but I would rather we pull the future forward and iterate fast on the refactor than holding this back.

Here are some questions/things to explore in case they prove useful:

  1. It would be nice to test out more granular instructions to see what the code looks like. We don't have to do it for all definitions, but maybe experiment with classes by supporting Class.new and Struct.new to see what emerges
  2. What do the reverse instructions look like for incremental resolution? Is it possible to generically reverse all operations (something like [IR::Undo, IR::DefineClass])? Or do we need to create each reverse operation one by one (e.g.: [IR::UndefClass])?
  3. After we have a proper VM that processes the IR directly without the intermediate local graph, how do we account for ordering and parallelism in processing the chunks? For example, what happens if we start interpreting one chunk and hit a constant dependency? I'd assume there's no way to order the chunks correctly because you can have different constant dependencies per instruction (and not per chunk), so is it possible to pause the execution of a chunk and resume after? Or do we need full dependency tracking?
  4. An interesting aspect is that the boundary between compiler and VM can be quite blurry. For example, if we find a private in the code, do we apply the visibility to the IR during compilation or does it become an instruction that only gets processed by the VM? For this particular example, I think the answer is the latter, but I'm curious about what other boundaries will change


/// The receiver of a singleton method definition.
#[derive(Debug)]
#[derive(Debug, Clone)]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the clone?

Comment on lines +19 to +25
#[must_use]
pub fn namespace_definition_id(uri_id: UriId, offset: &Offset, name_id: NameId) -> DefinitionId {
DefinitionId::from(&format!("{}{}{}", *uri_id, offset.start(), *name_id))
}

#[must_use]
pub fn method_definition_id(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we define these as associated functions of DefinitionId instead? Basically, so that we can do this instead:

DefinitionId::for_namespace()
DefinitionId::for_method()

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mod.rs thing is the old Rust naming style. Can we use this instead?

rubydex/src/operation.rs # defines the module
           /operation/whatever.rs

Also, my vote goes to naming it IR instead of operation.

}

#[derive(Debug)]
pub struct EnterClass {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's default all of these fields to private and expose with getters.

SetMethodVisibility, Target,
};

struct OperationPrinter<'a> {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessary for this PR, but it might be useful to consider the implementation as an M x N matrix of IR compilers and VMs.

For the compilers, we already have a trait with default behaviours to guide the implementation (Prism's visit trait). We should create a virtual machine trait that defaults to doing nothing for all instructions. It's a small thing, but it makes it easier to discover which instructions you haven't handled yet and you get better LSP support.

@@ -0,0 +1,2915 @@
//! Visit the Ruby AST and produce an ordered list of operations.
//!
//! Walks the parsed AST and produces `Vec<Operation>` that can later be applied
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idea: is it worth encapsulating a list of compiled instructions into a Chunk?

/// A lexical scope (class/module keyword) that produces a new constant resolution scope.
LexicalScope { name_id: NameId, is_module: bool },
/// An owner that doesn't produce a lexical scope (Class.new/Module.new).
Owner { name_id: NameId, is_module: bool },
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the other approach, we walk the nesting backwards to figure out if we're inside a module or not. Is storing the boolean here mostly a performance improvement? Or is there some other aspect?

/// without looking up definitions. Distinct from `operation::Target` which represents
/// the source-level receiver without resolved names.
#[derive(Clone, Copy)]
enum NestingReceiver {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be nicer to extract this into definitions.rs since the case of having something where the receiver is either self or a constant is so common. We can probably reuse the same enum for method and method alias definitions.

end
",
"
EnterClass(Foo)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love this style of tests. Very easy to visually verify what the expected IR is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants