Introduce `oak_db` and `File` by lionel- · Pull Request #1213 · posit-dev/ark

lionel- · 2026-05-13T08:48:43Z

Branched from #1211

Closes #1141
Progress towards #1212

This PR introduces

The oak_db crate which hosts a Salsa database containing all LSP inputs.
The File data structure that represents an R file (similar to Document in the legacy LSP). It holds the parse tree and semantic index for a file.
Dependency on salsa (latest version).
The parse tree is now a Salsa query on file contents, and the semantic index is a query on the parse tree.

The cache holds at most 128 parse trees at any time (following Rust-Analyser) to spare memory.

Higher level consumers should depend on narrow Salsa queries, to reduce the risk of being invalidated by unrelated edits. For this reason, although File contains the parse tree and semantic index, they are kept pub(crate) so they are not reachable from outside oak_db (e.g. from oak_ide).

For context we're working our way up towards:

A file tree starting from roots (workspace roots and lib path folders) and indexing all R files wrapped up in File
A Source Graph (Moving towards a source graph #1183) relating scripts, workspace packages, and installed packages. The nodes point to File handles from the file tree.

DavisVaughan

A few particularly important comments about our salsa::tracked methods. It feels important to get these right so if you disagree with my comments it would be nice to get clarification before merging

DavisVaughan · 2026-05-13T18:48:32Z

+/// Queries take a `dyn Db` rather than the concrete database owned by
+/// the LSP layer.
+#[salsa::db]
+pub trait Db: salsa::Database {}


So this is the "subtrait" idea and is where we get to add some of our own methods onto a database?

I guess eventually we'll probably bring over some of these ty ideas, in particular files()?

/// Most basic database that gives access to files, the host system, source code, and parsed AST. #[salsa::db] pub trait Db: salsa::Database { fn vendored(&self) -> &VendoredFileSystem; fn system(&self) -> &dyn System; fn files(&self) -> &Files; fn python_version(&self) -> PythonVersion; }

yep Files is coming up!

DavisVaughan · 2026-05-13T18:54:40Z

@@ -0,0 +1 @@
+mod file;


whats up with this file?

Sorry I don't understand

Like, this file literally just has mod file; in it but never looks to be used. What is it for?

It loads the test module tests/file.rs

Oh you have a unit test tests/ folder, I totally just assumed that was integration tests

Interesting model!

yep this way we keep implementation files clean while also avoiding exposing stuff for integration tests

DavisVaughan · 2026-05-13T19:20:08Z

+    /// (`exports`, `external_scope`) to avoid depending on the full index,
+    /// which would invalidate on every internal edit.
+    #[salsa::tracked(returns(ref))]
+    pub(crate) fn semantic_index(self, db: &dyn Db) -> SemanticIndex {


Claude has helped me understand that this currently unused but obviously useful helper is the reason for the Eq/PartialEq on SemanticIndex

(that and this one test crates/oak_db/src/tests/file.rs:114 which could be tweaked)

It still felt like a footgun to me that we have a big SemanticIndex as Eq/PartialEq, so I asked claude to tell me what in the world ty is doing differently.

And look at this!

// ruff/.../semantic_index.rs:56 #[salsa::tracked(returns(ref), no_eq, heap_size=...)] pub(crate) fn semantic_index(db: &dyn Db, file: File) -> SemanticIndex<'_> { ... }

no_eq tells salsa: don't bother comparing new vs old, just always invalidate downstream. So
SemanticIndex needs Update (for safe revision-to-revision storage) and get_size2::GetSize (for the
heap-size accounting), but PartialEq / Eq would be dead weight — they'd never get called.

And then it reiterated everything else we've been discussing regarding narrowly scoped queries

// 2. ty exposes narrow per-scope tracked queries on top. // ruff/.../semantic_index.rs:70, 93 #[salsa::tracked(returns(deref), heap_size=...)] pub(crate) fn place_table<'db>(db: &'db dyn Db, scope: ScopeId<'db>) -> Arc<PlaceTable> { let index = semantic_index(db, file); Arc::clone(&index.place_tables[scope.file_scope_id(db)]) }

Where each of those does proper caching, but if Salsa determines this function needs to be rerun then hitting that call semantic_index() unconditionally regenerates the semantic index, which actually does feel right to me

And here's the rest of Claude's comments if you're interested

Why this is a better design Backdating only pays off where you frequently produce structurally-equal output. At the aggregate level, that almost never happens — any meaningful edit touches something in the index, so a deep == on the whole SemanticIndex walks the entire file's worth of scopes/symbols/defs/uses and almost always returns false. Pure cost, no win. At the per-scope level, equality holds often: editing a function body leaves every sibling scope's PlaceTable and UseDefMap structurally identical. Backdating on those small pieces lets type inference for unaffected scopes skip entirely. That's the actual cache-granularity win. ty also gets a small parse-level safety net for free: a true no-op edit (type then delete) gets caught by OakParse-style structural equality at the parse query, so semantic_index never even re-runs. PartialEq on SemanticIndex itself isn't what catches that case. Where this leaves ark The PartialEq, Eq on oak_semantic::SemanticIndex is a placeholder, not a load-bearing design choice. The header comment at semantic_index.rs:47-51 and 46-65 actually anticipates exactly ty's strategy: ▎ "ty wraps per-scope tables in Arc so they can be returned from individual salsa tracked queries (e.g. ▎ place_table(db, scope) -> Arc<PlaceTable>). Salsa compares the returned Arc by Eq to skip re-running ▎ downstream queries when a scope's table hasn't changed. We defer that until salsa is introduced." Salsa is now introduced at the File/parse/semantic_index level, but the per-scope sub-queries aren't built yet. Until they are, the PartialEq derive on the aggregate buys you essentially nothing (and costs an expensive deep walk on every reparse where the index changed at all — i.e. most of them). Concrete next steps if you want to match ty: 1. Change File::semantic_index to #[salsa::tracked(returns(ref), no_eq)] in crates/oak_db/src/file.rs:46. 2. Drop PartialEq, Eq from SemanticIndex. You'll likely need salsa::Update (auto-derivable) instead; the inner field types may need it too. 3. Wrap the per-scope fields in Arc: IndexVec<ScopeId, Arc<SymbolTable>>, IndexVec<ScopeId, Arc<UseDefMap>>, etc. 4. Add narrow tracked queries (symbol_table(db, scope), use_def_map(db, scope), file_exports(db, file), file_attached_packages(db, file)) that Arc::clone the relevant field. These keep their default eq behavior so they can backdate. 5. Rewrite the test at crates/oak_db/src/tests/file.rs:114 (semantic_index_matches_oak_semantic) to compare via accessors (file_exports, scopes, etc.) instead of one big assert_eq!. The PartialEq derive isn't wrong today — it's just a forward-compatible placeholder that will get replaced when the per-scope query layer lands. If you want, you can do the no_eq flip now as a cheap, isolated win (skip the wasted deep comparison) and refactor the per-scope wrapping later.

yep I think I was confused on the need for coarse dep on the index while we transition towards narrow queries.
After all we don't care if the index is transitionally inefficient...

One subtlelty I now understand: even once we have only narrow tracked queries, we still need to track the index with no_eq. Otherwise each narrow query would rebuild it every time they are invalidated.

Now no_eq, but we still need Eq/PartialEq.

I'm seeing a caveat about cycles in the doc, so we might have to remove the no_eq later on, hopefully not (but probably no big deal if we have to).

DavisVaughan · 2026-05-13T19:26:22Z

+        let parsed = self.parse(db);
+        oak_semantic::semantic_index(&parsed.tree(), self.url(db))
+    }
+}


Random comment, I noticed this in ty

/// Returns the place table for a specific `scope`. /// /// Using [`place_table`] over [`semantic_index`] has the advantage that /// Salsa can avoid invalidating dependent queries if this scope's place table /// is unchanged. #[salsa::tracked(returns(deref), heap_size=ruff_memory_usage::heap_size)] pub(crate) fn place_table<'db>(db: &'db dyn Db, scope: ScopeId<'db>) -> Arc<PlaceTable> { let file = scope.file(db); let _span = tracing::trace_span!("place_table", scope=?scope.as_id(), ?file).entered(); let index = semantic_index(db, file); Arc::clone(&index.place_tables[scope.file_scope_id(db)]) }

That's cool, they have FileScopeId which I believe is most similar to our ScopeId right now - i.e. it starts at 0 every time for any given file

And then their ScopeId is abstracted on top of that to generate a unique id of File+FileScopeId via

/// A cross-module identifier of a scope that can be used as a salsa query parameter. #[salsa::tracked(debug, heap_size=ruff_memory_usage::heap_size)] pub struct ScopeId<'db> { pub file: File, pub file_scope_id: FileScopeId, }

So then their granular queries can query via a single ScopeId, which seems quite nice

Nice, it would make sense to have a similar tracked oak_db type

DavisVaughan · 2026-05-13T19:33:52Z

+    /// memory cleanly. Derived queries (e.g. `semantic_index`) store
+    /// `AstPtr`s rather than tree nodes, so they don't pin an evicted tree.
+    #[salsa::tracked(returns(ref), lru = 128)]
+    pub(crate) fn parse(self, db: &dyn Db) -> OakParse {


Ok ok, so this is going to sound weird, but bear with me

I think all salsa tracked functions should be parameterized as

pub(crate) fn parse(db: &dyn Db, <one salsa tracked item>) -> Value {

i.e. you'll notice that in ty they have

#[salsa::tracked(returns(ref), no_eq, heap_size=ruff_memory_usage::heap_size, lru=200)] pub fn parsed_module(db: &dyn Db, file: File) -> ParsedModule {

and in rust analyzer they have

impl EditionedFileId { pub fn parse(self, db: &dyn SourceDatabase) -> syntax::Parse<ast::SourceFile> { #[salsa::tracked(lru = 128)] pub fn parse( db: &dyn SourceDatabase, file_id: EditionedFileId, ) -> syntax::Parse<ast::SourceFile> { let _p = tracing::info_span!("parse", ?file_id).entered(); let (file_id, edition) = file_id.unpack(db); let text = db.file_text(file_id).text(db); ast::SourceFile::parse(text, edition) } parse(db, self) }

where the inner tracked parse() is of this form

I think this is a rather important footgun, but I don't quite know why

Carefully read this part of the book
https://salsa-rs.netlify.app/tutorial/parser#parameters-to-a-tracked-function

The first parameter to a tracked function is always the database, db: &dyn crate::Db. The second parameter to a tracked function is always some kind of Salsa struct. Tracked functions may take other arguments as well, though our examples here do not. Functions that take additional arguments are less efficient and flexible. It's generally better to structure tracked functions as functions of a single Salsa struct if possible.

So I have a strong gut feeling we should rewrite these as free functions of this format!

hmm I think methods and free functions expand to the same thing but I'll double check

yep they expand and behave exactly the same way.

We can keep in mind the advice regarding additional arguments. (Although I imagine the performance cost is small)

yea i also double checked today and methods of (self, db) and free functions of (db, tracked_struct) expand the same way as long as you mark the impl block with [salsa::tracked] as well, which you do

Book does not reflect this 😛

oh I think salsa is under quite active development, so the book might not always be up to date

Allows Salsa comparisons without depending on it (as Update would)

lionel- · 2026-05-14T13:47:29Z

I made File::semantic_index() private to ensure no coarse dependency can be made by other modules.

(It'll temporarily go back to pub(crate) in another PR so I can better stage things)

This was referenced May 13, 2026

Implement the semantic index #1141

Open

Add SourceGraph to Oak database #1214

Open

DavisVaughan approved these changes May 13, 2026

View reviewed changes

lionel- force-pushed the oak/salsa-1-ast-ptr branch from d08a92d to 8fd6930 Compare May 14, 2026 08:36

Base automatically changed from oak/salsa-1-ast-ptr to main May 14, 2026 08:37

lionel- force-pushed the oak/salsa-2-file branch from ad8bcd2 to 800d17c Compare May 14, 2026 08:48

lionel- added 5 commits May 14, 2026 11:46

Restore equality derivations

d741ac1

Allows Salsa comparisons without depending on it (as Update would)

Wrap heavier objects in Arc

adbecd9

Add File Salsa input

bef9ab2

Maintain at most 128 parse trees

ac607bb

Make File::semantic_index() private

95ae7c8

lionel- force-pushed the oak/salsa-2-file branch from 800d17c to 95ae7c8 Compare May 14, 2026 12:02

lionel- added 2 commits May 14, 2026 14:06

Address code review

55e91ec

Make File::semantic_index() no_eq

9e7b239

Conversation

lionel- commented May 13, 2026

Uh oh!

DavisVaughan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lionel- commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants