Skip to content

Introduce oak_db and File#1213

Open
lionel- wants to merge 7 commits into
mainfrom
oak/salsa-2-file
Open

Introduce oak_db and File#1213
lionel- wants to merge 7 commits into
mainfrom
oak/salsa-2-file

Conversation

@lionel-
Copy link
Copy Markdown
Contributor

@lionel- lionel- commented May 13, 2026

Branched from #1211

Closes #1141
Progress towards #1212

This PR introduces

  • The oak_db crate which hosts a Salsa database containing all LSP inputs.
  • The File data structure that represents an R file (similar to Document in the legacy LSP). It holds the parse tree and semantic index for a file.
  • Dependency on salsa (latest version).
  • The parse tree is now a Salsa query on file contents, and the semantic index is a query on the parse tree.

The cache holds at most 128 parse trees at any time (following Rust-Analyser) to spare memory.

Higher level consumers should depend on narrow Salsa queries, to reduce the risk of being invalidated by unrelated edits. For this reason, although File contains the parse tree and semantic index, they are kept pub(crate) so they are not reachable from outside oak_db (e.g. from oak_ide).

For context we're working our way up towards:

  • A file tree starting from roots (workspace roots and lib path folders) and indexing all R files wrapped up in File
  • A Source Graph (Moving towards a source graph #1183) relating scripts, workspace packages, and installed packages. The nodes point to File handles from the file tree.

Copy link
Copy Markdown
Contributor

@DavisVaughan DavisVaughan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few particularly important comments about our salsa::tracked methods. It feels important to get these right so if you disagree with my comments it would be nice to get clarification before merging

Comment thread crates/oak_db/src/db.rs
/// Queries take a `dyn Db` rather than the concrete database owned by
/// the LSP layer.
#[salsa::db]
pub trait Db: salsa::Database {}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is the "subtrait" idea and is where we get to add some of our own methods onto a database?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess eventually we'll probably bring over some of these ty ideas, in particular files()?

/// Most basic database that gives access to files, the host system, source code, and parsed AST.
#[salsa::db]
pub trait Db: salsa::Database {
    fn vendored(&self) -> &VendoredFileSystem;
    fn system(&self) -> &dyn System;
    fn files(&self) -> &Files;
    fn python_version(&self) -> PythonVersion;
}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep Files is coming up!

@@ -0,0 +1 @@
mod file;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whats up with this file?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I don't understand

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like, this file literally just has mod file; in it but never looks to be used. What is it for?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It loads the test module tests/file.rs

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh you have a unit test tests/ folder, I totally just assumed that was integration tests

Interesting model!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep this way we keep implementation files clean while also avoiding exposing stuff for integration tests

Comment thread crates/oak_db/src/tests/file.rs
Comment thread crates/oak_db/src/tests/file.rs Outdated
Comment thread crates/oak_db/src/file.rs
/// (`exports`, `external_scope`) to avoid depending on the full index,
/// which would invalidate on every internal edit.
#[salsa::tracked(returns(ref))]
pub(crate) fn semantic_index(self, db: &dyn Db) -> SemanticIndex {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude has helped me understand that this currently unused but obviously useful helper is the reason for the Eq/PartialEq on SemanticIndex

(that and this one test crates/oak_db/src/tests/file.rs:114 which could be tweaked)

It still felt like a footgun to me that we have a big SemanticIndex as Eq/PartialEq, so I asked claude to tell me what in the world ty is doing differently.

And look at this!

  // ruff/.../semantic_index.rs:56
  #[salsa::tracked(returns(ref), no_eq, heap_size=...)]
  pub(crate) fn semantic_index(db: &dyn Db, file: File) -> SemanticIndex<'_> { ... }

no_eq tells salsa: don't bother comparing new vs old, just always invalidate downstream. So
SemanticIndex needs Update (for safe revision-to-revision storage) and get_size2::GetSize (for the
heap-size accounting), but PartialEq / Eq would be dead weight — they'd never get called.

And then it reiterated everything else we've been discussing regarding narrowly scoped queries

  // 2. ty exposes narrow per-scope tracked queries on top.

  // ruff/.../semantic_index.rs:70, 93
  #[salsa::tracked(returns(deref), heap_size=...)]
  pub(crate) fn place_table<'db>(db: &'db dyn Db, scope: ScopeId<'db>) -> Arc<PlaceTable> {
      let index = semantic_index(db, file);
      Arc::clone(&index.place_tables[scope.file_scope_id(db)])
  }

Where each of those does proper caching, but if Salsa determines this function needs to be rerun then hitting that call semantic_index() unconditionally regenerates the semantic index, which actually does feel right to me

And here's the rest of Claude's comments if you're interested

  Why this is a better design

  Backdating only pays off where you frequently produce structurally-equal output. At the aggregate
  level, that almost never happens — any meaningful edit touches something in the index, so a deep == on
  the whole SemanticIndex walks the entire file's worth of scopes/symbols/defs/uses and almost always
  returns false. Pure cost, no win.

  At the per-scope level, equality holds often: editing a function body leaves every sibling scope's
  PlaceTable and UseDefMap structurally identical. Backdating on those small pieces lets type inference
  for unaffected scopes skip entirely. That's the actual cache-granularity win.

  ty also gets a small parse-level safety net for free: a true no-op edit (type then delete) gets caught
  by OakParse-style structural equality at the parse query, so semantic_index never even re-runs.
  PartialEq on SemanticIndex itself isn't what catches that case.

  Where this leaves ark

  The PartialEq, Eq on oak_semantic::SemanticIndex is a placeholder, not a load-bearing design choice.
  The header comment at semantic_index.rs:47-51 and 46-65 actually anticipates exactly ty's strategy:

  ▎ "ty wraps per-scope tables in Arc so they can be returned from individual salsa tracked queries (e.g.
  ▎  place_table(db, scope) -> Arc<PlaceTable>). Salsa compares the returned Arc by Eq to skip re-running
  ▎  downstream queries when a scope's table hasn't changed. We defer that until salsa is introduced."

  Salsa is now introduced at the File/parse/semantic_index level, but the per-scope sub-queries aren't
  built yet. Until they are, the PartialEq derive on the aggregate buys you essentially nothing (and
  costs an expensive deep walk on every reparse where the index changed at all — i.e. most of them).

  Concrete next steps if you want to match ty:

  1. Change File::semantic_index to #[salsa::tracked(returns(ref), no_eq)] in
  crates/oak_db/src/file.rs:46.
  2. Drop PartialEq, Eq from SemanticIndex. You'll likely need salsa::Update (auto-derivable) instead;
  the inner field types may need it too.
  3. Wrap the per-scope fields in Arc: IndexVec<ScopeId, Arc<SymbolTable>>, IndexVec<ScopeId, 
  Arc<UseDefMap>>, etc.
  4. Add narrow tracked queries (symbol_table(db, scope), use_def_map(db, scope), file_exports(db, file),
   file_attached_packages(db, file)) that Arc::clone the relevant field. These keep their default eq
  behavior so they can backdate.
  5. Rewrite the test at crates/oak_db/src/tests/file.rs:114 (semantic_index_matches_oak_semantic) to
  compare via accessors (file_exports, scopes, etc.) instead of one big assert_eq!.

  The PartialEq derive isn't wrong today — it's just a forward-compatible placeholder that will get
  replaced when the per-scope query layer lands. If you want, you can do the no_eq flip now as a cheap,
  isolated win (skip the wasted deep comparison) and refactor the per-scope wrapping later.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep I think I was confused on the need for coarse dep on the index while we transition towards narrow queries.
After all we don't care if the index is transitionally inefficient...

One subtlelty I now understand: even once we have only narrow tracked queries, we still need to track the index with no_eq. Otherwise each narrow query would rebuild it every time they are invalidated.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now no_eq, but we still need Eq/PartialEq.

I'm seeing a caveat about cycles in the doc, so we might have to remove the no_eq later on, hopefully not (but probably no big deal if we have to).

Comment thread crates/oak_db/src/file.rs
let parsed = self.parse(db);
oak_semantic::semantic_index(&parsed.tree(), self.url(db))
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Random comment, I noticed this in ty

/// Returns the place table for a specific `scope`.
///
/// Using [`place_table`] over [`semantic_index`] has the advantage that
/// Salsa can avoid invalidating dependent queries if this scope's place table
/// is unchanged.
#[salsa::tracked(returns(deref), heap_size=ruff_memory_usage::heap_size)]
pub(crate) fn place_table<'db>(db: &'db dyn Db, scope: ScopeId<'db>) -> Arc<PlaceTable> {
    let file = scope.file(db);
    let _span = tracing::trace_span!("place_table", scope=?scope.as_id(), ?file).entered();
    let index = semantic_index(db, file);
    Arc::clone(&index.place_tables[scope.file_scope_id(db)])
}

That's cool, they have FileScopeId which I believe is most similar to our ScopeId right now - i.e. it starts at 0 every time for any given file

And then their ScopeId is abstracted on top of that to generate a unique id of File+FileScopeId via

/// A cross-module identifier of a scope that can be used as a salsa query parameter.
#[salsa::tracked(debug, heap_size=ruff_memory_usage::heap_size)]
pub struct ScopeId<'db> {
    pub file: File,

    pub file_scope_id: FileScopeId,
}

So then their granular queries can query via a single ScopeId, which seems quite nice

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, it would make sense to have a similar tracked oak_db type

Comment thread crates/oak_db/src/file.rs
/// memory cleanly. Derived queries (e.g. `semantic_index`) store
/// `AstPtr`s rather than tree nodes, so they don't pin an evicted tree.
#[salsa::tracked(returns(ref), lru = 128)]
pub(crate) fn parse(self, db: &dyn Db) -> OakParse {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok ok, so this is going to sound weird, but bear with me

I think all salsa tracked functions should be parameterized as

pub(crate) fn parse(db: &dyn Db, <one salsa tracked item>) -> Value {

i.e. you'll notice that in ty they have

#[salsa::tracked(returns(ref), no_eq, heap_size=ruff_memory_usage::heap_size, lru=200)]
pub fn parsed_module(db: &dyn Db, file: File) -> ParsedModule {

and in rust analyzer they have

impl EditionedFileId {
    pub fn parse(self, db: &dyn SourceDatabase) -> syntax::Parse<ast::SourceFile> {
        #[salsa::tracked(lru = 128)]
        pub fn parse(
            db: &dyn SourceDatabase,
            file_id: EditionedFileId,
        ) -> syntax::Parse<ast::SourceFile> {
            let _p = tracing::info_span!("parse", ?file_id).entered();
            let (file_id, edition) = file_id.unpack(db);
            let text = db.file_text(file_id).text(db);
            ast::SourceFile::parse(text, edition)
        }
        parse(db, self)
    }

where the inner tracked parse() is of this form

I think this is a rather important footgun, but I don't quite know why

Carefully read this part of the book
https://salsa-rs.netlify.app/tutorial/parser#parameters-to-a-tracked-function

The first parameter to a tracked function is always the database, db: &dyn crate::Db.

The second parameter to a tracked function is always some kind of Salsa struct.

Tracked functions may take other arguments as well, though our examples here do not. Functions that take additional arguments are less efficient and flexible. It's generally better to structure tracked functions as functions of a single Salsa struct if possible.

So I have a strong gut feeling we should rewrite these as free functions of this format!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm I think methods and free functions expand to the same thing but I'll double check

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep they expand and behave exactly the same way.

We can keep in mind the advice regarding additional arguments. (Although I imagine the performance cost is small)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea i also double checked today and methods of (self, db) and free functions of (db, tracked_struct) expand the same way as long as you mark the impl block with [salsa::tracked] as well, which you do

Book does not reflect this 😛

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh I think salsa is under quite active development, so the book might not always be up to date

@lionel- lionel- force-pushed the oak/salsa-1-ast-ptr branch from d08a92d to 8fd6930 Compare May 14, 2026 08:36
Base automatically changed from oak/salsa-1-ast-ptr to main May 14, 2026 08:37
@lionel- lionel- force-pushed the oak/salsa-2-file branch from ad8bcd2 to 800d17c Compare May 14, 2026 08:48
@lionel- lionel- force-pushed the oak/salsa-2-file branch from 800d17c to 95ae7c8 Compare May 14, 2026 12:02
@lionel-
Copy link
Copy Markdown
Contributor Author

lionel- commented May 14, 2026

I made File::semantic_index() private to ensure no coarse dependency can be made by other modules.

(It'll temporarily go back to pub(crate) in another PR so I can better stage things)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement the semantic index

2 participants