Introduce oak_db and File#1213
Conversation
DavisVaughan
left a comment
There was a problem hiding this comment.
A few particularly important comments about our salsa::tracked methods. It feels important to get these right so if you disagree with my comments it would be nice to get clarification before merging
| /// Queries take a `dyn Db` rather than the concrete database owned by | ||
| /// the LSP layer. | ||
| #[salsa::db] | ||
| pub trait Db: salsa::Database {} |
There was a problem hiding this comment.
So this is the "subtrait" idea and is where we get to add some of our own methods onto a database?
There was a problem hiding this comment.
I guess eventually we'll probably bring over some of these ty ideas, in particular files()?
/// Most basic database that gives access to files, the host system, source code, and parsed AST.
#[salsa::db]
pub trait Db: salsa::Database {
fn vendored(&self) -> &VendoredFileSystem;
fn system(&self) -> &dyn System;
fn files(&self) -> &Files;
fn python_version(&self) -> PythonVersion;
}There was a problem hiding this comment.
yep Files is coming up!
| @@ -0,0 +1 @@ | |||
| mod file; | |||
There was a problem hiding this comment.
whats up with this file?
There was a problem hiding this comment.
Sorry I don't understand
There was a problem hiding this comment.
Like, this file literally just has mod file; in it but never looks to be used. What is it for?
There was a problem hiding this comment.
It loads the test module tests/file.rs
There was a problem hiding this comment.
Oh you have a unit test tests/ folder, I totally just assumed that was integration tests
Interesting model!
There was a problem hiding this comment.
yep this way we keep implementation files clean while also avoiding exposing stuff for integration tests
| /// (`exports`, `external_scope`) to avoid depending on the full index, | ||
| /// which would invalidate on every internal edit. | ||
| #[salsa::tracked(returns(ref))] | ||
| pub(crate) fn semantic_index(self, db: &dyn Db) -> SemanticIndex { |
There was a problem hiding this comment.
Claude has helped me understand that this currently unused but obviously useful helper is the reason for the Eq/PartialEq on SemanticIndex
(that and this one test crates/oak_db/src/tests/file.rs:114 which could be tweaked)
It still felt like a footgun to me that we have a big SemanticIndex as Eq/PartialEq, so I asked claude to tell me what in the world ty is doing differently.
And look at this!
// ruff/.../semantic_index.rs:56
#[salsa::tracked(returns(ref), no_eq, heap_size=...)]
pub(crate) fn semantic_index(db: &dyn Db, file: File) -> SemanticIndex<'_> { ... }no_eq tells salsa: don't bother comparing new vs old, just always invalidate downstream. So
SemanticIndex needs Update (for safe revision-to-revision storage) and get_size2::GetSize (for the
heap-size accounting), but PartialEq / Eq would be dead weight — they'd never get called.
And then it reiterated everything else we've been discussing regarding narrowly scoped queries
// 2. ty exposes narrow per-scope tracked queries on top.
// ruff/.../semantic_index.rs:70, 93
#[salsa::tracked(returns(deref), heap_size=...)]
pub(crate) fn place_table<'db>(db: &'db dyn Db, scope: ScopeId<'db>) -> Arc<PlaceTable> {
let index = semantic_index(db, file);
Arc::clone(&index.place_tables[scope.file_scope_id(db)])
}Where each of those does proper caching, but if Salsa determines this function needs to be rerun then hitting that call semantic_index() unconditionally regenerates the semantic index, which actually does feel right to me
And here's the rest of Claude's comments if you're interested
Why this is a better design
Backdating only pays off where you frequently produce structurally-equal output. At the aggregate
level, that almost never happens — any meaningful edit touches something in the index, so a deep == on
the whole SemanticIndex walks the entire file's worth of scopes/symbols/defs/uses and almost always
returns false. Pure cost, no win.
At the per-scope level, equality holds often: editing a function body leaves every sibling scope's
PlaceTable and UseDefMap structurally identical. Backdating on those small pieces lets type inference
for unaffected scopes skip entirely. That's the actual cache-granularity win.
ty also gets a small parse-level safety net for free: a true no-op edit (type then delete) gets caught
by OakParse-style structural equality at the parse query, so semantic_index never even re-runs.
PartialEq on SemanticIndex itself isn't what catches that case.
Where this leaves ark
The PartialEq, Eq on oak_semantic::SemanticIndex is a placeholder, not a load-bearing design choice.
The header comment at semantic_index.rs:47-51 and 46-65 actually anticipates exactly ty's strategy:
▎ "ty wraps per-scope tables in Arc so they can be returned from individual salsa tracked queries (e.g.
▎ place_table(db, scope) -> Arc<PlaceTable>). Salsa compares the returned Arc by Eq to skip re-running
▎ downstream queries when a scope's table hasn't changed. We defer that until salsa is introduced."
Salsa is now introduced at the File/parse/semantic_index level, but the per-scope sub-queries aren't
built yet. Until they are, the PartialEq derive on the aggregate buys you essentially nothing (and
costs an expensive deep walk on every reparse where the index changed at all — i.e. most of them).
Concrete next steps if you want to match ty:
1. Change File::semantic_index to #[salsa::tracked(returns(ref), no_eq)] in
crates/oak_db/src/file.rs:46.
2. Drop PartialEq, Eq from SemanticIndex. You'll likely need salsa::Update (auto-derivable) instead;
the inner field types may need it too.
3. Wrap the per-scope fields in Arc: IndexVec<ScopeId, Arc<SymbolTable>>, IndexVec<ScopeId,
Arc<UseDefMap>>, etc.
4. Add narrow tracked queries (symbol_table(db, scope), use_def_map(db, scope), file_exports(db, file),
file_attached_packages(db, file)) that Arc::clone the relevant field. These keep their default eq
behavior so they can backdate.
5. Rewrite the test at crates/oak_db/src/tests/file.rs:114 (semantic_index_matches_oak_semantic) to
compare via accessors (file_exports, scopes, etc.) instead of one big assert_eq!.
The PartialEq derive isn't wrong today — it's just a forward-compatible placeholder that will get
replaced when the per-scope query layer lands. If you want, you can do the no_eq flip now as a cheap,
isolated win (skip the wasted deep comparison) and refactor the per-scope wrapping later.
There was a problem hiding this comment.
yep I think I was confused on the need for coarse dep on the index while we transition towards narrow queries.
After all we don't care if the index is transitionally inefficient...
One subtlelty I now understand: even once we have only narrow tracked queries, we still need to track the index with no_eq. Otherwise each narrow query would rebuild it every time they are invalidated.
There was a problem hiding this comment.
Now no_eq, but we still need Eq/PartialEq.
I'm seeing a caveat about cycles in the doc, so we might have to remove the no_eq later on, hopefully not (but probably no big deal if we have to).
| let parsed = self.parse(db); | ||
| oak_semantic::semantic_index(&parsed.tree(), self.url(db)) | ||
| } | ||
| } |
There was a problem hiding this comment.
Random comment, I noticed this in ty
/// Returns the place table for a specific `scope`.
///
/// Using [`place_table`] over [`semantic_index`] has the advantage that
/// Salsa can avoid invalidating dependent queries if this scope's place table
/// is unchanged.
#[salsa::tracked(returns(deref), heap_size=ruff_memory_usage::heap_size)]
pub(crate) fn place_table<'db>(db: &'db dyn Db, scope: ScopeId<'db>) -> Arc<PlaceTable> {
let file = scope.file(db);
let _span = tracing::trace_span!("place_table", scope=?scope.as_id(), ?file).entered();
let index = semantic_index(db, file);
Arc::clone(&index.place_tables[scope.file_scope_id(db)])
}That's cool, they have FileScopeId which I believe is most similar to our ScopeId right now - i.e. it starts at 0 every time for any given file
And then their ScopeId is abstracted on top of that to generate a unique id of File+FileScopeId via
/// A cross-module identifier of a scope that can be used as a salsa query parameter.
#[salsa::tracked(debug, heap_size=ruff_memory_usage::heap_size)]
pub struct ScopeId<'db> {
pub file: File,
pub file_scope_id: FileScopeId,
}So then their granular queries can query via a single ScopeId, which seems quite nice
There was a problem hiding this comment.
Nice, it would make sense to have a similar tracked oak_db type
| /// memory cleanly. Derived queries (e.g. `semantic_index`) store | ||
| /// `AstPtr`s rather than tree nodes, so they don't pin an evicted tree. | ||
| #[salsa::tracked(returns(ref), lru = 128)] | ||
| pub(crate) fn parse(self, db: &dyn Db) -> OakParse { |
There was a problem hiding this comment.
Ok ok, so this is going to sound weird, but bear with me
I think all salsa tracked functions should be parameterized as
pub(crate) fn parse(db: &dyn Db, <one salsa tracked item>) -> Value {i.e. you'll notice that in ty they have
#[salsa::tracked(returns(ref), no_eq, heap_size=ruff_memory_usage::heap_size, lru=200)]
pub fn parsed_module(db: &dyn Db, file: File) -> ParsedModule {and in rust analyzer they have
impl EditionedFileId {
pub fn parse(self, db: &dyn SourceDatabase) -> syntax::Parse<ast::SourceFile> {
#[salsa::tracked(lru = 128)]
pub fn parse(
db: &dyn SourceDatabase,
file_id: EditionedFileId,
) -> syntax::Parse<ast::SourceFile> {
let _p = tracing::info_span!("parse", ?file_id).entered();
let (file_id, edition) = file_id.unpack(db);
let text = db.file_text(file_id).text(db);
ast::SourceFile::parse(text, edition)
}
parse(db, self)
}where the inner tracked parse() is of this form
I think this is a rather important footgun, but I don't quite know why
Carefully read this part of the book
https://salsa-rs.netlify.app/tutorial/parser#parameters-to-a-tracked-function
The first parameter to a tracked function is always the database, db: &dyn crate::Db.
The second parameter to a tracked function is always some kind of Salsa struct.
Tracked functions may take other arguments as well, though our examples here do not. Functions that take additional arguments are less efficient and flexible. It's generally better to structure tracked functions as functions of a single Salsa struct if possible.
So I have a strong gut feeling we should rewrite these as free functions of this format!
There was a problem hiding this comment.
hmm I think methods and free functions expand to the same thing but I'll double check
There was a problem hiding this comment.
yep they expand and behave exactly the same way.
We can keep in mind the advice regarding additional arguments. (Although I imagine the performance cost is small)
There was a problem hiding this comment.
yea i also double checked today and methods of (self, db) and free functions of (db, tracked_struct) expand the same way as long as you mark the impl block with [salsa::tracked] as well, which you do
Book does not reflect this 😛
There was a problem hiding this comment.
oh I think salsa is under quite active development, so the book might not always be up to date
d08a92d to
8fd6930
Compare
Allows Salsa comparisons without depending on it (as Update would)
|
I made (It'll temporarily go back to |
Branched from #1211
Closes #1141
Progress towards #1212
This PR introduces
oak_dbcrate which hosts a Salsa database containing all LSP inputs.Filedata structure that represents an R file (similar toDocumentin the legacy LSP). It holds the parse tree and semantic index for a file.salsa(latest version).The cache holds at most 128 parse trees at any time (following Rust-Analyser) to spare memory.
Higher level consumers should depend on narrow Salsa queries, to reduce the risk of being invalidated by unrelated edits. For this reason, although
Filecontains the parse tree and semantic index, they are keptpub(crate)so they are not reachable from outsideoak_db(e.g. fromoak_ide).For context we're working our way up towards:
FileFilehandles from the file tree.