Add operation-based indexing pipeline#798
Conversation
Move definition ID computation (namespace_definition_id, method_definition_id) from inline format strings in each definition type into model::ids so they can be reused by the operation applier.
Introduce the Operation enum with 18 struct variants (EnterClass, EnterModule, EnterMethod, DefineConstant, etc.) plus ExitScope. Each struct carries its own data; scope context is provided by surrounding Enter/Exit operations.
Implement OperationVisitor to produce a human-readable, indented text representation of operations. Used by builder tests to assert on output.
Walk the Ruby AST (via ruby_prism) to produce a Vec<Operation>. This is the first phase of the two-phase indexing pipeline: source code to operations.
Implement OperationVisitor to convert operations into definitions in a LocalGraph. This is the second phase of the pipeline: operations to indexed definitions. LocalGraph::from_parts allows constructing a graph from pre-built parts (strings, names, document) so the applier can build graphs without going through the indexer.
Introduce IndexerBackend to switch between RubyIndexer and the operation pipeline. Add build_local_graph to dispatch indexing based on the backend.
Both RubyIndexer and OperationApplier now run the same tests via a shared ruby_indexer_tests.rs included with #[path]. Each parent provides a backend() function to select which pipeline to use.
Wire up the IndexerBackend enum to the CLI so users can choose between the original RubyIndexer and the operation pipeline.
e4e719f to
fd915fa
Compare
vinistock
left a comment
There was a problem hiding this comment.
I really believe moving towards an IR will give us a much nicer way to express all of the richness in Ruby code.
It's challenging to reason about the implementation because we are supporting both mechanisms temporarily and without a VM to go along with the IR, but I would rather we pull the future forward and iterate fast on the refactor than holding this back.
Here are some questions/things to explore in case they prove useful:
- It would be nice to test out more granular instructions to see what the code looks like. We don't have to do it for all definitions, but maybe experiment with classes by supporting
Class.newandStruct.newto see what emerges - What do the reverse instructions look like for incremental resolution? Is it possible to generically reverse all operations (something like
[IR::Undo, IR::DefineClass])? Or do we need to create each reverse operation one by one (e.g.:[IR::UndefClass])? - After we have a proper VM that processes the IR directly without the intermediate local graph, how do we account for ordering and parallelism in processing the chunks? For example, what happens if we start interpreting one chunk and hit a constant dependency? I'd assume there's no way to order the chunks correctly because you can have different constant dependencies per instruction (and not per chunk), so is it possible to pause the execution of a chunk and resume after? Or do we need full dependency tracking?
- An interesting aspect is that the boundary between compiler and VM can be quite blurry. For example, if we find a
privatein the code, do we apply the visibility to the IR during compilation or does it become an instruction that only gets processed by the VM? For this particular example, I think the answer is the latter, but I'm curious about what other boundaries will change
|
|
||
| /// The receiver of a singleton method definition. | ||
| #[derive(Debug)] | ||
| #[derive(Debug, Clone)] |
| #[must_use] | ||
| pub fn namespace_definition_id(uri_id: UriId, offset: &Offset, name_id: NameId) -> DefinitionId { | ||
| DefinitionId::from(&format!("{}{}{}", *uri_id, offset.start(), *name_id)) | ||
| } | ||
|
|
||
| #[must_use] | ||
| pub fn method_definition_id( |
There was a problem hiding this comment.
Can we define these as associated functions of DefinitionId instead? Basically, so that we can do this instead:
DefinitionId::for_namespace()
DefinitionId::for_method()There was a problem hiding this comment.
The mod.rs thing is the old Rust naming style. Can we use this instead?
rubydex/src/operation.rs # defines the module
/operation/whatever.rs
Also, my vote goes to naming it IR instead of operation.
| } | ||
|
|
||
| #[derive(Debug)] | ||
| pub struct EnterClass { |
There was a problem hiding this comment.
Let's default all of these fields to private and expose with getters.
| SetMethodVisibility, Target, | ||
| }; | ||
|
|
||
| struct OperationPrinter<'a> { |
There was a problem hiding this comment.
Not necessary for this PR, but it might be useful to consider the implementation as an M x N matrix of IR compilers and VMs.
For the compilers, we already have a trait with default behaviours to guide the implementation (Prism's visit trait). We should create a virtual machine trait that defaults to doing nothing for all instructions. It's a small thing, but it makes it easier to discover which instructions you haven't handled yet and you get better LSP support.
| @@ -0,0 +1,2915 @@ | |||
| //! Visit the Ruby AST and produce an ordered list of operations. | |||
| //! | |||
| //! Walks the parsed AST and produces `Vec<Operation>` that can later be applied | |||
There was a problem hiding this comment.
Idea: is it worth encapsulating a list of compiled instructions into a Chunk?
| /// A lexical scope (class/module keyword) that produces a new constant resolution scope. | ||
| LexicalScope { name_id: NameId, is_module: bool }, | ||
| /// An owner that doesn't produce a lexical scope (Class.new/Module.new). | ||
| Owner { name_id: NameId, is_module: bool }, |
There was a problem hiding this comment.
On the other approach, we walk the nesting backwards to figure out if we're inside a module or not. Is storing the boolean here mostly a performance improvement? Or is there some other aspect?
| /// without looking up definitions. Distinct from `operation::Target` which represents | ||
| /// the source-level receiver without resolved names. | ||
| #[derive(Clone, Copy)] | ||
| enum NestingReceiver { |
There was a problem hiding this comment.
I think it would be nicer to extract this into definitions.rs since the case of having something where the receiver is either self or a constant is so common. We can probably reuse the same enum for method and method alias definitions.
| end | ||
| ", | ||
| " | ||
| EnterClass(Foo) |
There was a problem hiding this comment.
I love this style of tests. Very easy to visually verify what the expected IR is.
What
Introduces an intermediate representation (IR) between Ruby parsing and graph construction. Instead of the indexer writing directly to a
LocalGraph, a new pipeline produces an orderedVec<Operation>(an explicit list of what each file contributes to the graph), then applies those operations to build the graph.Current pipeline:
graph LR AST --> RubyIndexer --> LocalGraph --> Graph --> ResolutionNew pipeline:
graph LR AST --> RubyOperationBuilder --> Operations["Vec<Operation>"] --> OperationApplier --> LocalGraph --> Graph --> Resolution(This is an intermediate step to ensure parity with the current indexer / local graph and reduce the scope of this PR.)
Target pipeline:
graph LR AST --> Operations["Vec<Operation>"] --> Graph["Graph (resolution applies operations directly)"]Once we commit to this direction, we can remove the
LocalGraphand feed the operations directly to the resolver to let it create both the definitions and the declarations.Why
Four motivations that reinforce each other:
DSL plugin API
Plugins need to contribute definitions to the graph. Without operations, plugins mutate
LocalGraphinternals: fragile, tightly coupled, and bad for parallelism (Ruby plugins calling into the graph would require synchronization).With operations, a plugin just produces
Vec<Operation>, a pure data structure that can be built in parallel and applied later. Operations become the public contract between Rubydex and plugins.An ActiveRecord plugin sees
has_many :commentsand emits operations. It doesn't touch graph internals, it speaks the same operation protocol as the built-in indexer.Unloading
Each file's contribution is its operation list. Negating the operations (walk in reverse, opposite effect) cleanly unloads the file without graph diffing.
Because DSL plugins produce the same IR primitives as the built-in indexer, we know how to undo their contributions for free: no plugin-specific cleanup logic, no tracking which definitions in the graph came from which plugin.
Incremental re-indexing
Because operations are a flat list, we can diff the old and new operation lists when a file changes. Only the delta gets applied: remove operations that disappeared, add new ones. Most edits change very few operations, so the diff is small.
No full rebuild, no negating the entire file, just the minimal set of changes.
Order-dependent diagnostics
Operations preserve source order, which the current graph doesn't. Consider:
Ruby raises
NameErrorbecauseprivate_constant :BARruns beforeBARexists. Today Rubydex can't catch this because both the constant and the visibility change become unordered definitions in the graph, and resolution wires them together regardless of order. With operations, the ordering is explicit (SetConstantVisibilitybeforeDefineConstant), and a diagnostic pass can detect the forward reference.How
Operationenum (operation/mod.rs): ~20 variants covering namespace scopes (EnterClass,EnterModule,EnterSingletonClass,ExitScope), method operations (EnterMethod,DefineAttribute,AliasMethod,SetMethodVisibility,SetDefaultVisibility), constants (DefineConstant,AliasConstant,SetConstantVisibility), variables (DefineInstanceVariable,DefineClassVariable,DefineGlobalVariable), mixins (Mixin), and references (ReferenceConstant,ReferenceMethod). Each variant is its own struct so consumers can take fields by value or by reference.OperationPrinter(operation/printer.rs): Debug formatter that pretty-printsVec<Operation>with indentation reflecting scope nesting. Used for debugging and testing the builder output.RubyOperationBuilder(operation/ruby_builder.rs): Walks the Ruby AST and producesVec<Operation>. Same visitor pattern asRubyIndexer, but emits operations instead of mutating a graph. ~3100 lines, 1:1 correspondence with everyRubyIndexerbehavior.OperationApplier(operation/applier.rs): ConvertsVec<Operation>→LocalGraph. Consumes operations by value (zero clones). This is scaffolding that exists only because the current merge pipeline expects aLocalGraph. In the target architecture, resolution consumes operations directly and bothOperationApplierandLocalGraphgo away.IndexerBackendenum (indexing.rs,main.rs):--indexer ruby_indexer|operation_builderCLI flag to switch backends. Both backends run the same shared test suite (166 test functions × 2 backends).Parity verification
Both the indexer test suite (166 tests) and the resolution test suite (208 tests) run with both backends using a
#[path]module trick — each parent module provides abackend()function viasuper::backend(), so the same test file exercises bothRubyIndexerandOperationBuilder.Both backends produce identical output on Shopify core:
Performance
Essentially identical performance. The intermediate
Vec<Operation>per file adds a very small overhead.