Skip to content

feat(#71): migrate service layer to list[Transaction] — remove dict round-trips#79

Merged
longieirl merged 6 commits intomainfrom
feat/71-service-layer-list-transaction
Mar 26, 2026
Merged

feat(#71): migrate service layer to list[Transaction] — remove dict round-trips#79
longieirl merged 6 commits intomainfrom
feat/71-service-layer-list-transaction

Conversation

@longieirl
Copy link
Copy Markdown
Owner

@longieirl longieirl commented Mar 26, 2026

Summary

Closes #71. Migrates the service layer to use list[Transaction] throughout, eliminating the dict round-trips that previously occurred in every service call. Also fixes a regression introduced during migration where CSV output had empty Debit/Credit/Balance columns.

Changes

  • Transaction gains document_type and transaction_type fields; from_dict()/to_dict() updated
  • IDuplicateDetector, ITransactionSorting, IIBANGrouping, ITransactionFilter protocols updated to list[Transaction]
  • DuplicateDetectionStrategy.detect_duplicates() and all create_key() methods accept Transaction directly
  • ServiceRegistry.process_transaction_group() / group_by_iban() operate on list[Transaction]; enrichment helpers mutate tx fields in place
  • processor.py passes list[Transaction] through pipeline; dicts only at output boundary via transactions_to_dicts()
  • Bug fix: transactions_to_dicts() now passes currency_symbol="" so dict keys ("Debit") match neutral DEFAULT_COLUMNS names — previously caused empty CSV columns
  • Unused transactions_to_dicts import removed from pdf_processing_orchestrator
  • All 10 affected test files updated: fixtures use Transaction.from_dict(), assertions use attribute access
  • Converter tests updated to assert neutral column names; new regression test builds a real DataFrame to prove key alignment

Type

  • Bug fix
  • New feature
  • Breaking change
  • Refactoring
  • Documentation
  • Performance
  • Security

Testing

  • Tests pass (coverage ≥ 91%)
  • Manually tested
  • make docker-integration passed locally (required when touching Dockerfile, entrypoint.sh, docker-compose.yml, or packages/parser-core/)

Checklist

  • Code follows project style
  • Self-reviewed
  • Documentation updated (if needed)
  • No new warnings

…ound-trips

- Transaction gains document_type and transaction_type fields
- IDuplicateDetector, ITransactionSorting, IIBANGrouping, ITransactionFilter
  protocols updated to list[Transaction]
- DuplicateDetectionStrategy.detect_duplicates() and all create_key() methods
  accept Transaction objects directly
- ServiceRegistry.process_transaction_group() / group_by_iban() operate on
  list[Transaction]; enrichment helpers mutate tx fields in place
- processor.py passes list[Transaction] through pipeline; dicts only at output
  boundary via transactions_to_dicts()
- Unused transactions_to_dicts import removed from pdf_processing_orchestrator
- All 10 affected test files updated: fixtures use Transaction.from_dict() and
  assertions use attribute access (tx.filename etc.)
…utral column names

to_dict() defaulted to currency_symbol='€' producing 'Debit €' keys, but
column_names uses neutral 'Debit'/'Credit'/'Balance' keys — causing pandas
DataFrame to produce empty columns in CSV output.
… regression test

- test_transaction_to_dict and test_transactions_to_dicts now assert 'Debit',
  'Credit', 'Balance' keys (not 'Debit €' etc.)
- New test_transactions_to_dicts_keys_match_default_column_names builds a
  DataFrame with DEFAULT_COLUMNS names to prove key alignment holds
@longieirl longieirl self-assigned this Mar 26, 2026
@longieirl
Copy link
Copy Markdown
Owner Author

Code review

Found 2 issues:

  1. _is_header_transaction calls tx.to_dict() with default currency_symbol="€", inflating non_empty_fields and diluting the 50% header-detection threshold

to_dict() emits 5 extra metadata keys beyond the column fields (source_page, confidence_score, extraction_warnings, document_type, transaction_type). These are always non-empty after enrichment, so they inflate non_empty_fields without contributing to matches. A header row that matched 3/5 fields (60%) pre-enrichment would match 3/8 fields (37.5%) post-enrichment and slip through the filter. The call should be tx.to_dict(currency_symbol="") at minimum, but ideally _is_header_row should explicitly skip the known metadata keys (as it already skips "Filename").

def _is_header_transaction(self, tx: Transaction) -> bool:
"""Check if a Transaction is a header row by inspecting its field values."""
row = tx.to_dict()
return self._is_header_row(row)
def _is_header_row(self, row: dict) -> bool:
"""Check if a row is a header row.

  1. DateAmountDuplicateStrategy.create_key now produces different keys than the old dict-based code — source_page and confidence_score were previously summed incidentally as floats

The old code iterated all dict keys and summed any value parseable as a float. source_page (e.g. "1"1.0) and confidence_score (e.g. "0.95") were therefore included in the total. The new code explicitly iterates only ("debit", "credit", "balance") plus additional_fields, so those two values are dropped. A transaction with debit=50.00, balance=100.00, source_page=1, confidence_score=0.95 previously produced a key with total 151.95; now it produces 150.00. Transactions from overlapping statement periods that were previously detected as duplicates will no longer match.

def create_key(self, transaction: "Transaction") -> str:
"""Create key from date and sum of all amounts."""
total = 0.0
for field_name in ("debit", "credit", "balance"):
value = getattr(transaction, field_name, None)
if value:
amount = to_float(str(value))
if amount is not None:
total += abs(amount)
for value in transaction.additional_fields.values():
if value:
amount = to_float(str(value))
if amount is not None:
total += abs(amount)
return f"{transaction.date}:{total:.2f}"

🤖 Generated with Claude Code

If this code review was useful, please react with 👍. Otherwise, react with 👎.

…mount key behaviour

Bug 1 (transaction_filter.py): _is_header_transaction was calling tx.to_dict()
which includes metadata fields (source_page, confidence_score,
extraction_warnings, document_type, transaction_type). These inflated
non_empty_fields without contributing to matches, diluting the 50%
header-detection threshold. Fixed by inspecting only the five data columns
(Date, Details, Debit, Credit, Balance) directly from the Transaction object.

Bug 2 (strategies.py): DateAmountDuplicateStrategy.create_key now documents
that only monetary fields are summed. The old dict-based code accidentally
included source_page and confidence_score in the total — the new code is
correct and the change in key values is intentional.
@longieirl longieirl merged commit c7e4c00 into main Mar 26, 2026
11 checks passed
@longieirl longieirl deleted the feat/71-service-layer-list-transaction branch March 26, 2026 22:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

tech-debt: service layer uses list[dict] instead of list[Transaction] (CURR-01)

2 participants