Create a production-ready SQLite virtual table extension that presents a collection of directories containing JSONL (JSON Lines) files as queryable SQL tables, enabling flexible analysis and correlation of chat session data with other data sources. The extension must provide a zero-setup user experience and support multi-threaded environments (web servers).
Claude Code generates chat session logs stored as:
- Projects: Directories representing different coding projects
- Sessions: JSONL files (one per chat session) identified by UUID
- Messages: Individual JSON records within JSONL files
Current challenges:
- Data is scattered across multiple directories and files
- No unified query interface for analysis
- Difficult to correlate chat data with external sources
- File structure makes ad-hoc queries cumbersome
- New files are constantly being created (real-time requirement)
- Need production-ready solution for multi-user web server environments
- Zero-setup experience: Load extension and immediately query without manual table creation
- Read project directories and JSONL files dynamically (no data copying)
- Present data as three SQL tables:
projects,sessions,messages - Support standard SQL queries (SELECT, WHERE, JOIN, GROUP BY, etc.)
- Extract stable JSON fields to table columns
- Preserve full JSON for flexible field access
- Handle six message types:
user,assistant,system,file-history-snapshot,summary,queue-operation - Provide SQL helper functions for common JSON extractions
- Enforce read-only access (no INSERT/UPDATE/DELETE)
- Support flexible configuration via key=value parameters
- Real-time visibility: Queries see current state of filesystem
- No data duplication: Virtual tables, not copied data
- Handle large files: Some JSONL files exceed 100MB
- Performance: Query optimization via intelligent xBestIndex implementation
- Thread safety: Safe for multi-threaded web server environments
- Stability: Don't crash on malformed JSON or missing files
- Portability: Works with homebrew SQLite on macOS
- Concurrency: Compatible with WAL mode for multi-user access
Chosen: SQLite virtual table extension
Rationale:
- SQL provides flexible, ad-hoc query interface (use cases not fully known)
- Easy to JOIN with other data sources
- Standard tool (works with any SQLite client)
- Virtual table API designed exactly for this use case
- Better than custom Python API for exploratory queries
- Production-ready with proper implementation
Alternatives Considered:
- Python/TypeScript API: More flexible implementation but non-standard interface, harder for third-party tools
- Load into actual tables: Better performance but requires sync logic, duplicates data, loses real-time visibility
- File-based tools (grep/jq): No relational queries or JOINs
Chosen: Three separate virtual tables (projects, sessions, messages)
Rationale:
- Natural hierarchy matches data structure
- Better query ergonomics (filter by project name, not table name)
- Foreign key relationships express data model clearly
- Standard SQL patterns (JOIN across tables)
- Easier to understand and explain
- Enables efficient query pruning in partitioned scenarios
Alternatives Considered:
- Single unified table: Requires repeating project/session data on every message row, denormalized, harder to query
- Table-per-project: Project names would need SQL escaping, meta-programming required, can't easily query across projects
Chosen: Automatic table creation on extension load
User Experience:
sqlite3 :memory: -init .claude_code_init
# Tables immediately available:
sqlite> SELECT p.project_id, COUNT(*)
...> FROM projects p
...> JOIN sessions s ON p.project_id = s.project_id
...> GROUP BY p.project_id;Implementation:
- Extension performs one-time census on load
- Auto-creates three virtual tables:
projects,sessions,messages - User can immediately query without manual
CREATE VIRTUAL TABLEstatements
Rationale:
- Dramatically improves user experience
- Reduces setup friction for experimentation
- Consistent interface across all installations
- Enables configuration via init file instead of SQL commands
Chosen: Intelligent cost estimation based on WHERE constraints
Critical Insight: Performance lives and dies by xBestIndex implementation. The extension must communicate accurate I/O costs to SQLite's query planner to avoid expensive full scans.
Implementation Strategy:
For messages table:
- Plan 1: Full Scan - No constraints → Must scan all files → Cost: ~10,000,000
- Plan 2: Session Filter -
session_id = ?constraint → Read one file → Cost: ~5,100 - Plan 3: Project Filter -
project_id = ?constraint → Scan project files → Cost: ~500,000
Example xBestIndex Implementation:
static int messages_best_index(sqlite3_vtab *tab, sqlite3_index_info *pIdxInfo) {
int session_idx = -1, project_idx = -1;
// Find usable constraints from WHERE clause
for (int i = 0; i < pIdxInfo->nConstraint; i++) {
if (!pIdxInfo->aConstraint[i].usable) continue;
if (pIdxInfo->aConstraint[i].op == SQLITE_INDEX_CONSTRAINT_EQ) {
if (pIdxInfo->aConstraint[i].iColumn == COL_SESSION_ID) {
session_idx = i;
} else if (pIdxInfo->aConstraint[i].iColumn == COL_PROJECT_ID) {
project_idx = i;
}
}
}
// Report cost based on best available constraint
if (session_idx != -1) {
// Best plan: session_id constraint
pIdxInfo->idxNum = PLAN_SESSION_FILTER;
pIdxInfo->aConstraintUsage[session_idx].argvIndex = 1;
pIdxInfo->aConstraintUsage[session_idx].omit = 1;
pIdxInfo->estimatedCost = 5100.0;
pIdxInfo->estimatedRows = 1000;
} else if (project_idx != -1) {
// Good plan: project_id constraint
pIdxInfo->idxNum = PLAN_PROJECT_FILTER;
pIdxInfo->aConstraintUsage[project_idx].argvIndex = 1;
pIdxInfo->aConstraintUsage[project_idx].omit = 1;
pIdxInfo->estimatedCost = 500000.0;
pIdxInfo->estimatedRows = 10000;
} else {
// Worst plan: full scan
pIdxInfo->idxNum = PLAN_FULL_SCAN;
pIdxInfo->estimatedCost = 10000000.0;
pIdxInfo->estimatedRows = 1000000;
}
return SQLITE_OK;
}Rationale:
- Accurate cost estimation enables optimal query plans
- Avoids scanning millions of records when filtering by session/project
- Critical for interactive query performance
- Foundation for future partitioning optimizations
Chosen: Mutex-protected one-time initialization
Target Environment: Multi-threaded web servers where multiple threads may load extension simultaneously.
Implementation:
// Global state
static sqlite3_mutex *g_init_mutex = NULL;
static int g_initialized = 0;
int sqlite3_claude_code_init(sqlite3 *db, char **pzErrMsg,
const sqlite3_api_routines *pApi) {
SQLITE_EXTENSION_INIT2(pApi);
// Lazy-initialize mutex
if (!g_init_mutex) {
g_init_mutex = sqlite3_mutex_alloc(SQLITE_MUTEX_STATIC_MASTER);
}
sqlite3_mutex_enter(g_init_mutex);
if (g_initialized) {
sqlite3_mutex_leave(g_init_mutex);
return SQLITE_OK; // Already initialized
}
// Perform census, create tables, register functions
// ... initialization logic ...
g_initialized = 1;
sqlite3_mutex_leave(g_init_mutex);
return SQLITE_OK;
}Rationale:
- Prevents race conditions during initialization
- Essential for web server deployments
- Uses SQLite's built-in mutex primitives
- Low overhead after initial setup
Database Concurrency: Users should enable WAL mode for optimal multi-user concurrency:
PRAGMA journal_mode=WAL;In WAL mode, readers do not block writers, enabling concurrent access essential for web applications.
Chosen: Hybrid approach - extract stable fields, keep full JSON
Extracted fields:
type: Message type (always present)timestamp: Message timestamp (present in most types)uuid/message_id: Unique identifierparent_id: Thread structure (parentUuid)user_type: Message originator (userType)session_id: Links to session UUIDproject_id: Links to project directory name
Stored as JSON:
json_data(TEXT column): Full JSON object for flexible access
Rationale:
- Different message types have vastly different schemas
messageobject varies significantly (user vs. assistant)- Nested objects (thinking blocks, usage stats, etc.)
- JSON schema may evolve over time
- Users can extract any field via
json_extract(json_data, '$.path') - Common queries benefit from indexed columns (type, timestamp, session_id)
- xBestIndex can optimize queries on extracted columns
Alternatives Considered:
- Fully denormalized: Extract all fields → many nullable columns, schema changes break extension
- Pure JSON: Everything in json_data → slow for common queries, no indexes
Chosen: Virtual tables are strictly read-only
Rationale:
- Source of truth is JSONL files, not database
- Prevents accidental data corruption
- Simpler implementation (no xUpdate logic)
- Matches primary use case (analysis, not modification)
Implementation: Do not implement xUpdate function in virtual table module, or implement to return SQLITE_READONLY.
Chosen: TEXT column type for JSON data
Rationale:
- SQLite has no dedicated JSON or JSONB type
- JSON is stored as TEXT in SQLite
- SQLite's json1 extension provides functions (
json_extract(), etc.) - No performance difference vs. hypothetical JSON type
Chosen: Provide custom SQL functions alongside virtual tables
Functions to implement:
get_message_content(json): Extract message content (handle arrays/text)get_message_role(json): Extract message.roleget_message_model(json): Extract model for assistant messagesget_message_text(json): Get plain text (skip thinking blocks)message_token_count(json): Extract token usage statsis_thinking_message(json): Check if contains thinking block
Rationale:
- Common extractions are tedious with
json_extract() - Improves query readability
- Can evolve independently of schema
- Users can still use
json_extract()for custom fields
Documentation approach: Create _jsonl_functions metadata table listing available functions with examples.
Chosen: Helper functions parse JSON on every call (no caching)
Rationale:
- Real-time requirement: Caching would serve stale data if files change
- Correctness over performance: Always reflect current file state
- Simpler implementation: No cache invalidation logic
- Memory efficiency: No cache memory overhead
- Acceptable performance: Modern JSON parsing is fast enough for interactive queries
If performance becomes an issue: Users can materialize results into actual tables for repeated access.
Alternatives Considered:
- cJSON object caching: Violates real-time visibility, complex invalidation, memory overhead
Chosen: cJSON (https://github.com/DaveGamble/cJSON)
Rationale:
- Simple, single-file library (easy to vendor)
- MIT licensed
- Fast and lightweight
- Widely used
- Sufficient for our parsing needs
Alternatives Considered:
- json-c: More mature but heavier dependency
- SQLite's built-in JSON: Complex to use from C in virtual table context
- Custom parser: Reinventing the wheel, risk of bugs
Chosen: Generic key=value argument parser in xConnect
Implementation:
static int jsonl_connect(sqlite3 *db, void *pAux, int argc,
const char *const *argv, ...) {
// argv[0] = module name, argv[1] = db name, argv[2] = table name
// argv[3+] are key=value arguments
for (int i = 3; i < argc; i++) {
parse_kv_argument(argv[i], &vtab->config);
}
// ... rest of connect logic
}Supported Parameters (current and future):
base_directory=/path- Root directory for scanningexclude_pattern=agent-*- File exclusion patterninclude_project=proj_name- Limit to specific project (for partitioning)include_projects=p1,p2,p3- Limit to project list (for partitioning)
Rationale:
- Low-cost upfront design enables future features
- Enables partitioning without API changes
- Standard SQLite virtual table pattern
- Easy to extend with new parameters
Approach: Lazy loading with no caching
Implementation:
- Don't load all files into memory on
xFilter() - Stream records as SQLite calls
xNext() - Open/close JSONL files as needed
- Parse JSON on-demand per row
- Use xBestIndex to avoid unnecessary file reads
Tradeoffs:
- First query may be slower (directory scan)
- Memory-efficient (no bulk loading)
- Always sees current filesystem state
COUNT(*)queries still require full scan (acceptable for current scale)
Optimization Strategy:
- Phase 1-4: Basic lazy loading
- Phase 5: Intelligent xBestIndex with constraint detection
- Phase 7+: Partitioning for large-scale data
CREATE TABLE projects (
project_id TEXT PRIMARY KEY, -- Directory name (e.g., "-Users-rmichael-Documents-...")
directory TEXT NOT NULL, -- Full path to directory
created_at DATETIME, -- Directory creation time (ctime)
updated_at DATETIME -- Directory modification time (mtime)
);Source: Subdirectories under base_directory parameter
CREATE TABLE sessions (
session_id TEXT PRIMARY KEY, -- UUID from JSONL filename
project_id TEXT NOT NULL, -- Foreign key to projects
file_path TEXT NOT NULL, -- Full path to .jsonl file
record_count INT, -- Number of messages in session
created_at DATETIME, -- File creation time (ctime)
updated_at DATETIME, -- File modification time (mtime)
FOREIGN KEY (project_id) REFERENCES projects(project_id)
);Source: *.jsonl files in project directories (excluding agent-*.jsonl by default)
CREATE TABLE messages (
message_id TEXT PRIMARY KEY, -- uuid field from JSON record
session_id TEXT NOT NULL, -- Foreign key to sessions
type TEXT NOT NULL, -- Message type (user, assistant, system, etc.)
timestamp DATETIME, -- From record.timestamp
parent_id TEXT, -- parentUuid (NULL for root messages)
user_type TEXT, -- userType field
content_type TEXT, -- subtype for system messages
json_data TEXT NOT NULL, -- Full JSON object as text
FOREIGN KEY (session_id) REFERENCES sessions(session_id)
);Source: Individual lines from JSONL files
The extension automatically creates tables on load:
# Create init file
echo ".load ./build/claude_code.dylib" > .claude_code_init
# Start SQLite - tables are immediately available
sqlite3 :memory: -init .claude_code_init
sqlite> SELECT project_id, COUNT(*) AS msg_count
...> FROM projects p
...> JOIN sessions s ON p.project_id = s.project_id
...> JOIN messages m ON s.session_id = m.session_id
...> GROUP BY p.project_id
...> ORDER BY msg_count DESC;The extension scans a default directory (configurable at compile time or via environment variable):
// Default: ~/.claude/projects
// Or set via: export CLAUDE_PROJECTS_DIR=/custom/pathFor custom configurations:
.load ./claude_code.dylib
CREATE VIRTUAL TABLE custom_projects USING jsonl_projects(
base_directory='/custom/path',
exclude_pattern='agent-*'
);
-- Creates: custom_projects_projects, custom_projects_sessions, custom_projects_messages-- Count messages per project
SELECT p.project_id, COUNT(*) as msg_count
FROM projects p
JOIN sessions s ON p.project_id = s.project_id
JOIN messages m ON s.session_id = m.session_id
GROUP BY p.project_id
ORDER BY msg_count DESC;
-- Find all assistant messages in a specific project
SELECT m.timestamp, get_message_text(m.json_data) as content
FROM messages m
JOIN sessions s ON m.session_id = s.session_id
WHERE s.project_id = 'claude-code-logger'
AND m.type = 'assistant'
ORDER BY m.timestamp;
-- Analyze token usage across sessions
SELECT s.session_id, SUM(message_token_count(m.json_data)) as total_tokens
FROM sessions s
JOIN messages m ON s.session_id = m.session_id
WHERE m.type = 'assistant'
GROUP BY s.session_id;
-- Efficient query using session_id constraint (low cost via xBestIndex)
SELECT type, COUNT(*)
FROM messages
WHERE session_id = '136a28d8-a077-4e92-a098-db544000f4cd'
GROUP BY type;- Homebrew SQLite:
/opt/homebrew/Cellar/sqlite/3.51.0 - macOS: Darwin 23.6.0 (Apple Silicon assumed)
- Build tools: clang, gmake
- File sizes: Some JSONL files exceed 100MB
- Real-time data: Files being created/modified during queries
- Multi-threaded: Must support concurrent access from web servers
For multi-user web applications, enable WAL mode:
PRAGMA journal_mode=WAL;This allows readers and writers to work concurrently without blocking each other.
- Use WHERE clauses with
session_idorproject_idwhen possible (leverages xBestIndex optimization) - For repeated queries on static data, materialize results into actual tables
- Monitor query performance with
.timer on
Sessions currently identified only by UUID. Future enhancement:
- Add
session_namecolumn tosessionstable - Extract from JSONL metadata or external mapping file
- Support human-readable session descriptions
As data grows, automatic partitioning will become necessary. The flexible configuration architecture supports this.
Strategy: Census → Partition → Unify
-
Census Phase (in init function):
- One-time scan of base_directory
- Build in-memory list of all projects and their sizes
- Classify projects as "large" (>250MB) or "small"
-
Partition Phase:
- Create dedicated virtual tables for large projects:
CREATE VIRTUAL TABLE _partition_projA_messages USING jsonl_messages( include_project='project_A' );
- Create grouped virtual tables for small projects:
CREATE VIRTUAL TABLE _partition_small_messages USING jsonl_messages( include_projects='proj_B,proj_C,proj_D' );
- Create dedicated virtual tables for large projects:
-
Unify Phase:
- Create user-facing VIEW combining all partitions:
CREATE VIEW messages AS SELECT * FROM _partition_projA_messages UNION ALL SELECT * FROM _partition_projB_messages UNION ALL SELECT * FROM _partition_small_messages;
- Create user-facing VIEW combining all partitions:
Query Pruning Mechanism:
When users query WHERE project_id = 'project_A', SQLite pushes this constraint down to all UNION ALL branches. Partition-aware xBestIndex implementations detect when a query is impossible for a partition and report zero cost/rows:
static int messages_best_index(sqlite3_vtab *tab, sqlite3_index_info *pIdxInfo) {
jsonl_vtab *pVtab = (jsonl_vtab*)tab;
const char *requested_project = find_project_constraint(pIdxInfo);
if (requested_project && !is_project_in_partition(pVtab, requested_project)) {
// This partition cannot satisfy this query - signal pruning
pIdxInfo->estimatedCost = 0.0;
pIdxInfo->estimatedRows = 0;
return SQLITE_OK;
}
// ... normal cost calculation for valid queries
}The 0/0 signal tells SQLite to completely skip that partition, making queries highly efficient even with many partitions.
- Currently assumes single-user filesystem
- Could add user_id dimension in future
- Would require changes to directory structure and scanning logic
- Zero-setup: Extension loads and tables are immediately queryable
- Performance: xBestIndex optimization makes filtered queries fast
- Thread safety: Safe for multi-threaded web server use
- Accuracy: Queries return correct data from test and real JSONL files
- Large files: Files >100MB handled without crashes
- Real-time: New files appear in queries after creation
- Read-only: Modification attempts fail gracefully
- Helper functions: Work correctly without caching issues
- Documentation: Enables team members to use and extend the project
- Production-ready: Passes all tests, no memory leaks, clean code
- SQLite Virtual Table API: https://sqlite.org/vtab.html
- SQLite xBestIndex: https://sqlite.org/vtab.html#xbestindex
- CSV Extension (reference): https://sqlite.org/csv.html
- SQLite C API: https://sqlite.org/c3ref/intro.html
- SQLite WAL Mode: https://sqlite.org/wal.html
- cJSON Library: https://github.com/DaveGamble/cJSON
- SQLite JSON Functions: https://sqlite.org/json1.html
- SQLite Thread Safety: https://sqlite.org/threadsafe.html
Document Status: Living document - updated with team feedback Last Updated: 2025-11-19 Authors: Team + Claude Code