Skip to content

tabular data ingestion#1720

Merged
jperez999 merged 119 commits intoNVIDIA:mainfrom
ftatiana-nv:staging
Apr 7, 2026
Merged

tabular data ingestion#1720
jperez999 merged 119 commits intoNVIDIA:mainfrom
ftatiana-nv:staging

Conversation

@tomer-levin-nv
Copy link
Copy Markdown
Collaborator

Description

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • If adjusting docker-compose.yaml environment variables have you ensured those are mimicked in the Helm values.yaml file.

liavnave added 2 commits April 1, 2026 15:40
…ularExtractParams

- Replace connection_string field with connector: Optional[SQLDatabase] in TabularExtractParams
- Add arbitrary_types_allowed=True to TabularExtractParams model_config for Pydantic v2
- Update extract_tabular_db_data to read connector from params; return {} when absent
- Guard store_relational_db_in_neo4j against empty data dict
- Update connectors/__init__.py to remove DuckDB export after move to tabular-dev-tools
- Update debug_ingest.py to instantiate DuckDB connector and pass via TabularExtractParams
- Fix test_tabular_pipeline.py: _DummyDuckDB now subclasses SQLDatabase; tests updated
- Update setup_spider2.py and duckdb.py docstring import examples
- Add langchain-nvidia-ai-endpoints import and pyproject.toml platform env for macOS ARM

Made-with: Cursor
… folder

- Move connectors/sql_database.py → tabular_data/sql_database.py
- Delete connectors/__init__.py and the now-empty connectors/ package
- Update all imports to nemo_retriever.tabular_data.sql_database

Made-with: Cursor
@jperez999
Copy link
Copy Markdown
Collaborator

seems like this has conflicts now and needs to be rebased.

…uckdb package

The file was named duckdb.py which caused Python to import itself instead of
the installed duckdb package, resulting in AttributeError on duckdb.connect().
Also changed default connection mode to read_only=True to prevent exclusive
file lock conflicts when multiple processes access the same .duckdb file.

Made-with: Cursor
…cture

- Accept deletion of ingest_modes/batch.py (removed in main as part of the
  architecture migration away from the legacy ingestor pattern).
- Update TabularIngestOperator:
  * _BatchEmbedActor import moved from ingest_modes.batch → text_embed.operators
  * _LanceDBWriteActor (gone in main) replaced with handle_lancedb() from
    vector_store.lancedb_store, materialising the Ray Dataset before writing.

Made-with: Cursor
_build_lancedb_rows_from_df expects a list[dict]; iterating a raw DataFrame
yields column names (strings), causing AttributeError on row.get().

Made-with: Cursor
Remove TabularDescriptionParams, TabularFetchParams, TabularSemanticLayerParams,
and TabularUsageWeightsParams; inline hardcoded keys in fetch_tabular_embedding_dataframe.

Made-with: Cursor
Split into _TabularIngestBase, TabularIngestCPUActor, and a graph-facing
TabularIngestOperator(ArchetypeOperator) that always resolves to the CPU
variant, since GPU embedding is handled internally via Ray actor pools.

Made-with: Cursor
…erators

Remove the monolithic TabularIngestOperator and replace it with two
focused operators each in their own file:
- TabularExtractOp: extracts DB schema and writes to Neo4j, returns DataFrame
- FetchTabularEmbeddingDfOp: fetches entity descriptions from Neo4j

Users now compose these with the existing _BatchEmbedActor to build
their own graph; LanceDB write is handled outside the graph (matching
the unstructured pipeline pattern).

Made-with: Cursor
TabularExtractOp -> TabularSchemaExtractOp (tabular_schema_extract_operator.py)
FetchTabularEmbeddingDfOp -> TabularFetchEmbeddingsOp (tabular_fetch_embeddings_operator.py)

Made-with: Cursor
Remove the graph/ wrapper folder and promote its sub-packages
(model/, dal/, parsers/, services/, utils.py, indexes.py) directly
under ingestion/. Update all import paths accordingly.

Made-with: Cursor
The class was already renamed to Neo4jNode; align the filename to
match. Update all four import sites accordingly.

Made-with: Cursor
Resolved conflicts in ingestor.py and params/__init__.py by keeping
both StoreParams (from main) and TabularExtractParams/TextChunkParams
(from staging).

Made-with: Cursor
- Add SPDX license headers to all new tabular_data Python files
- Fix tuple literals in JSON schema description fields (generate_sql.py)
- Replace naive datetime.now() with datetime.now(timezone.utc) in populate_fks()
- Make num_workers and dialect configurable params in store_relational_db_in_neo4j()
- Switch read-only MATCH query from query_write() to query_read() in embeddings.py
- Replace print() with logger.error() and re-raise in neo4j_connection.py
- Make Neo4j connection lazy (call get_neo4j_conn() per function, not at import time)
- Fix unterminated string literals caused by literal newlines in single-quoted strings
- Fix empty DataFrame column handling in normalize_tables/normalize_columns
- Add early-exit guard in populate_tabular_data() for empty tables_df
- Change DuckDB connector default read_only from True to False

Made-with: Cursor
- Use datetime.now(timezone.utc) for db_node pulled property in schemas_parser.py
- Re-raise exception and use logger.error in add_schema() failure handler

Made-with: Cursor
Remove NEO4J_HEAP_INITIAL, NEO4J_HEAP_MAX, and NEO4J_PAGECACHE which
have no counterpart in helm/values.yaml, causing Helm/Docker parity gap.
Retain NEO4J_PLUGINS for APOC which is required by the ingestion pipeline.

Made-with: Cursor
Cypher does not permit trailing commas in function arguments; remove
them from both apoc.merge.node.eager() calls in add_edges() to prevent
a syntax error on every invocation.

Made-with: Cursor
@NVIDIA NVIDIA deleted a comment from greptile-apps bot Apr 7, 2026
@jperez999 jperez999 merged commit 8906dd8 into NVIDIA:main Apr 7, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants