Skip to content

Add canonical document and ingestion schema models with RawAsset support#26

Draft
AymanL wants to merge 1 commit intocode_relocationfrom
canonical_doc_schema
Draft

Add canonical document and ingestion schema models with RawAsset support#26
AymanL wants to merge 1 commit intocode_relocationfrom
canonical_doc_schema

Conversation

@AymanL
Copy link
Collaborator

@AymanL AymanL commented Mar 18, 2026

PR Summary (canonical document schema)

This PR introduces the canonical document schema foundation, aligning existing ingestion objects with a more explicit “canonical document” (definite source of truth) model vocabulary.

What changed

  • Added RawAsset abstraction (eu_fact_force/ingestion/models.py)
    • Introduced abstract RawAsset(TimeStampedModel) with shared raw-asset fields/behavior.
    • Updated SourceFile to inherit from RawAsset while preserving existing fields/relations and keeping delete_source_document_from_s3() as a backward-compatible wrapper.
  • New canonical models (eu_fact_force/ingestion/models.py)
    • Document: stores provider-agnostic metadata (doi, title, authors_raw, published_date, source, language, document_type, external_ids) plus optional links to SourceFile (raw_asset) and the primary ParsedArtifact.
    • IngestionRun: records ingestion attempts with status/error/provider and links to Document and/or SourceFile.
    • ParsedArtifact: represents parsed outputs tied to a SourceFile, optionally linked to Document, including format, storage_key, page_count, parser info, and stats.
  • Extended DocumentChunk (eu_fact_force/ingestion/models.py)
    • Added canonical chunk metadata fields: document, parsed_artifact, chunk_type, page_number, start_offset, end_offset.
    • Kept existing chunk/search fields (content, order, embedding) unchanged for compatibility.
  • Updated Django admin (eu_fact_force/ingestion/admin.py)
    • Registered the new models (Document, ParsedArtifact, IngestionRun) and expanded DocumentChunk admin display to include the new fields.
  • Database migration (eu_fact_force/ingestion/migrations/0003_canonical_document_schema.py)
    • Creates tables for Document, IngestionRun, ParsedArtifact.
    • Adds the new fields/relations and indexing/constraints to Document and DocumentChunk.

Tests

  • Added tests/ingestion/test_canonical_document_schema.py
    • Covers Document DOI uniqueness behavior (non-empty DOIs enforced).
    • Validates linkage wiring for IngestionRun ↔ Document/SourceFile, and ParsedArtifact ↔ Document/SourceFile.
    • Confirms persistence of new DocumentChunk metadata fields.
  • Verified by running: uv run pytest tests/ingestion/ -v (20 tests passed).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant