Skip to content

SA-663 Load embedding from flat file#66

Open
ivyONS wants to merge 10 commits into
mainfrom
sa-663-flat-file-load
Open

SA-663 Load embedding from flat file#66
ivyONS wants to merge 10 commits into
mainfrom
sa-663-flat-file-load

Conversation

@ivyONS
Copy link
Copy Markdown
Contributor

@ivyONS ivyONS commented May 13, 2026

✨ Summary

Refactors EmbeddingHandler to load and build vector stores from arbitrary flat CSV files instead of only from the SIC hierarchy XLS sources. SIC-specific loading is extracted into a standalone utility (sic_specific_embed.py). Return types for search_index and search_index_multi are replaced with typed Pydantic response models, that is imported into ***-vector-store-api. GCS support is extended to single-file downloads (to handle CSV indes source files).

📜 Changes Introduced

  • feat: EmbeddingHandler.__init__ accepts index_source_file (CSV path or GCS URI); when provided, builds a new vector store; otherwise loads an existing one — replaces the previous _load_or_build_vector_store logic
  • feat: _load_existing_vector_store now raises FileNotFoundError instead of returning None, making the error explicit
  • feat: _build_vector_store — loads data from flat file (native classifai format, no metadata)
  • feat: New sic_specific_embed.load_embedding_handler_from_sic_index_files preserves the old SIC-hierarchy-based build path as an explicit utility function
  • feat: gcs_file_access.download_one_file_from_gcs — downloads a single GCS file to a temp directory for use during vector store build
  • feat: SearchIndexItem & SearchIndexResponse Pydantic models replace list[dict] return types on search_index and search_index_multi
  • feat: EmbeddingHandler.get_embed_config returns a typed EmbeddingConfig snapshot
  • chore: EmbeddingConfig migrated from TypedDict to Pydantic BaseModel; adds index_source_file and index_size fields
  • chore: Example data files converted from .txt to .csv (toy_index, sic_2d_condensed, sic_4d_condensed)
  • chore: CustomVertexAIEmbeddings removed
  • test: test_embedding.py fully updated - adds unit tests
  • test: test_gcs_file_access.py adds tests for download_one_file_from_gcs (success + missing file)

✅ Checklist

Please confirm you've completed these checks before requesting a review.

  • Code is formatted using Black
  • Imports are sorted using isort
  • Code passes linting with Ruff, Pylint, and Mypy
  • Security checks pass using Bandit
  • API and Unit tests are written and pass using pytest
  • Terraform files (if applicable) follow best practices and have been validated (terraform fmt & terraform validate)
  • DocStrings follow Google-style and are added as per Pylint recommendations
  • Documentation has been updated if needed

🔍 How to Test

  • check all test pass make all-tests
  • check demo notebook (demos/embed/sic_embedding_example.py)
  • make sure it works in sic-classification-vector-store

@ivyONS ivyONS requested a review from gibbardsteve May 14, 2026 07:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant