Add configurable list formatting for CSV/TSV serialization#3134
Add configurable list formatting for CSV/TSV serialization#3134
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3134 +/- ##
==========================================
+ Coverage 79.92% 83.77% +3.85%
==========================================
Files 144 144
Lines 16579 16597 +18
Branches 3421 3428 +7
==========================================
+ Hits 13250 13904 +654
+ Misses 2606 1918 -688
- Partials 723 775 +52
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
e788b02 to
2264cc3
Compare
There was a problem hiding this comment.
Pull request overview
This PR adds configurable list formatting for CSV/TSV serialization to address issue #3041, enabling users to control how multivalued fields are serialized (with or without brackets, custom delimiters, and whitespace handling).
Changes:
- Adds schema-level annotations (
list_syntax,list_delimiter,list_strip_whitespace) to control multivalued field formatting in CSV/TSV output - Implements CLI options (
--list-syntax,--list-delimiter,--list-strip-whitespace) to override schema annotations - Extends CSV/TSV loaders and dumpers to handle plaintext-style lists (e.g.,
a|b|c) in addition to python-style lists (e.g.,[a|b|c])
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| docs/data/csvs.md | Comprehensive documentation of the new configuration options with examples and usage instructions |
| packages/linkml/src/linkml/converter/cli.py | Adds three new CLI options for list formatting that apply to both input and output CSV/TSV operations |
| packages/linkml_runtime/src/linkml_runtime/dumpers/delimited_file_dumper.py | Implements list formatting configuration for CSV/TSV output, reading from schema annotations or CLI overrides |
| packages/linkml_runtime/src/linkml_runtime/loaders/delimited_file_loader.py | Implements list formatting configuration for CSV/TSV input, including helper functions for annotation reading and whitespace stripping |
| tests/linkml_runtime/test_loaders_dumpers/test_csv_tsv_loader_dumper.py | Comprehensive integration tests covering plaintext mode, custom delimiters, whitespace handling, and edge cases |
| tests/linkml_runtime/test_utils/test_csv_utils.py | Unit tests for annotation reading (contains a test schema that uses slot-level annotations inconsistently with implementation) |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
tests/linkml_runtime/test_loaders_dumpers/test_csv_tsv_loader_dumper.py
Outdated
Show resolved
Hide resolved
tests/linkml_runtime/test_loaders_dumpers/test_csv_tsv_loader_dumper.py
Outdated
Show resolved
Hide resolved
|
Re: patch coverage
Update: Added CLI integration tests in commit a1db0e5. The CLI options ( Test coverage includes:
|
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
packages/linkml_runtime/src/linkml_runtime/loaders/delimited_file_loader.py
Show resolved
Hide resolved
|
|
||
|
|
||
| # ============================================================================= | ||
| # pytest-style unit tests for annotation-based CSV configuration (issue #3041) |
There was a problem hiding this comment.
This test file is now a hybrid of 3 styles:
- UnitTest
- pure pytest
- non-idiomatic pytest (using classes)
The guidelines aren't clear what to do when contributing a new test to an existing UnitTest file:
https://linkml.io/linkml/maintainers/contributing.html#unittest-to-pytest-conversion
I favor consistency here, and I'd just convert this all to pure pytest
| # ----------------------------------------------------------------------------- | ||
| # Inline test schemas for annotation testing | ||
| # | ||
| # We use inline schemas here because: |
There was a problem hiding this comment.
This is a bit too much verbiage, sounds like the results of a conversation with an agent
There was a problem hiding this comment.
Fixed. Removed the verbose comment blocks.
| # ----------------------------------------------------------------------------- | ||
| # Note on KeyConfig generation for multivalued primitive slots | ||
| # | ||
| # The _get_key_config() function in csvutils.py is NOT modified per Chris's |
There was a problem hiding this comment.
Adding comments to tests is good if they help future maintenance or explain he purpose or function of the text, but this looks like a conversation that has lost its context
There was a problem hiding this comment.
Fixed. Removed the KeyConfig note — it was stale context.
| return SchemaView(SCHEMA_WHITESPACE_PRESERVE) | ||
|
|
||
|
|
||
| class TestWhitespaceStripping: |
There was a problem hiding this comment.
this style of pytest seems non-idiomatic for this repo
There was a problem hiding this comment.
Fixed. Converted all test classes to plain test_* functions, including the pre-existing CsvAndTsvGenTestCase (mechanical conversion, no behavior change).
cmungall
left a comment
There was a problem hiding this comment.
make the tests more consistent with other tests
| from linkml_runtime.dumpers import csv_dumper, json_dumper, tsv_dumper, yaml_dumper | ||
| from linkml_runtime.loaders import csv_loader, tsv_loader, yaml_loader | ||
| from linkml_runtime.loaders.delimited_file_loader import ( | ||
| _get_list_config_from_annotations, |
There was a problem hiding this comment.
consider making more clearly intended as public
(doctests might work better if it's intended as private)
There was a problem hiding this comment.
Made them public by dropping the underscore prefix: get_list_config_from_annotations, enhance_configmap_for_multivalued_primitives, strip_whitespace_from_lists.
Considered doctests, but the repo has no doctest infrastructure — there's no --doctest-modules in pyproject.toml and no doctest step in CI. The existing >>> examples in ~9 source files are documentation-only and never executed as tests. Making the functions public with dedicated pytest coverage seemed more practical.
Consolidated unaddressed feedback — list formatting (PR #3134)Gathering all outstanding feedback from multiple sources so nothing falls through the cracks. 1. Chris's CHANGES_REQUESTED review (Feb 5) — addressed in code, awaiting re-reviewAll 5 inline comments have been addressed:
Status: Code changes pushed, Chris has not re-reviewed yet. 2. Chris's design spec from Dec 15 rolling meeting notesThe agreed-upon design from our Dec 15 meeting (Chris & Mark rolling notes): # Schema annotation spec
attributes:
name_list:
multivalued: true
annotations:
list_syntax: python ## allowed: python | plaintext
list_delimiter: "; " ## must include space explicitly. No effect if list_syntax == 'python'Mapping to json-flattener:
Cascading: Use SchemaView — default to schema-level annotations, slot-level annotations override. Additional guidance from Chris:
Need to verify: Does the current implementation match this spec exactly? Specifically:
3. Chris's helper function visibility comment (PR inline)Chris said "consider making more clearly intended as public (doctests might work better if it's intended as private)". I made them public and noted the repo has no doctest infrastructure — filed #3146 to track adding it. Chris hasn't responded to this. 4. Copilot review — invalid
|
9b1e739 to
d0aa8d5
Compare
|
@cmungall Heads up on one deviation from our Dec 15 spec. We discussed slot-level annotations overriding schema-level defaults, but the current implementation only supports schema-level annotations. Why: json-flattener's I think schema-level-only is the right call here. The main use case driving this (MIxS-style "semicolon-delimited, no brackets") is uniform across all multivalued fields in a given schema anyway. If a per-slot use case comes up later, we can add it via json-flattener at that point. Also rebased onto current main and resolved the conflict with #3118's new converter tests. All 54 tests pass (25 converter + 29 CSV/TSV runtime). Ready for re-review when you get a chance. |
Scope and known limitationsWhat this PR does and doesn't touchThis PR modifies the That means after this merges, Schema-level only annotationsOur Dec 15 spec discussed slot-level annotation overrides via SchemaView. The implementation is schema-level only — see my earlier comment for why (json-flattener's Known edge cases
|
SSSOM alignment — context from today's linkml-dev meetingNico suggested looking at how SSSOM handles multivalued field packing in TSV. Turns out the SSSOM ecosystem is actively debating the same problems we're solving here, literally today. How SSSOM does itFrom the SSSOM/TSV spec:
Active spec work happening right now
How PR #3134 compares
With EscapingNeither SSSOM (1.0) nor this PR handle delimiter-in-value escaping. SSSOM is actively working on backslash escaping for 1.1 (backslash-pipe for literal pipe, double-backslash for literal backslash). If/when that lands and we want to support it in LinkML, it would be a follow-up — the annotation-based configuration in this PR provides the right foundation to build on. Related
|
Add unit and integration tests for configurable multivalued field delimiters in CSV/TSV serialization. Tests follow Chris Mungall's design guidance: logic should be in loader/dumper files, not csvutils.py. Tests include: - Annotation reading (list_syntax, list_delimiter) via SchemaView - Integration tests using personinfo.yaml with dynamic alias injection - Parametrized tests for different delimiter configurations - Edge case tests (empty lists, single values, delimiter in values) All new tests are skipped pending implementation, following TDD approach. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add support for customizing multivalued field formatting in CSV/TSV serialization via slot annotations: - list_syntax: "python" (default, with brackets) or "plaintext" (no brackets) - list_delimiter: custom delimiter between list items (default "|") Implementation: - Add _get_list_config_from_annotations() to read annotations from schema - Add _enhance_configmap_for_multivalued_primitives() for plaintext mode - Update loader and dumper to use annotation-derived configuration - Logic is in loader/dumper files per Chris Mungall's guidance Tests: - Enable plaintext roundtrip tests (now passing) - Enable custom delimiter tests for |, ;, and , (now passing) - 16 tests passing, 14 skipped (edge cases for future work) Documentation: - Add "Customizing multivalued field formatting" section to docs/data/csvs.md - Document list_syntax and list_delimiter annotations with examples Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
) json-flattener's GlobalConfig applies the same csv_list_markers and csv_inner_delimiter to all columns, so slot-level overrides don't make sense. Simplified implementation to only read schema-level annotations. Updated docs and tests to reflect this design constraint. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Ruff UP006/UP035: Use lowercase tuple instead of typing.Tuple Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Use ordinal for temp filename to avoid Windows reserved chars like | Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Remove TestCsvConfigFromAnnotations and test_list_syntax_to_markers (tested helper functions we never implemented - logic is in loader/dumper) - Remove TestPersoninfoAliasesIntegration (used schema without annotations) - Remove unused fixtures and inline schemas - Update comments to reflect schema-level only design Tests: 17 passed, 3 skipped (pre-existing issues outside PR scope) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Test edge cases for _get_list_config_from_annotations and _enhance_configmap_for_multivalued_primitives: - None schemaview returns defaults - Schema without annotations returns defaults - plaintext_mode=False returns original configmap Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add YAML→TSV conversion example using existing test files - Add TSV→YAML conversion example showing plaintext parsing - Use markdown table to show sample TSV data - Update terminology to "python style (bracketed)" Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add list_strip_whitespace annotation (default true) to control whether whitespace around delimiters is stripped when loading - Add CLI options to linkml-convert to override schema annotations: --list-syntax, --list-delimiter, --list-strip-whitespace - Update documentation with new annotation and CLI options - Add tests for whitespace stripping functionality Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Whitespace stripping now works for both loading and dumping - On input: "a | b" → ['a', 'b'] (stripped) or ['a ', ' b'] (preserved) - On output: ['dog ', 'cat'] → "dog|cat" (stripped) or "dog |cat" (preserved) - Add tests for output whitespace stripping - Update documentation to clarify bidirectional behavior Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Move inline schemas to module-level constants (SCHEMA_WITHOUT_ANNOTATIONS, SCHEMA_WHITESPACE_STRIP, SCHEMA_WHITESPACE_PRESERVE) - Add make_delimiter_schema() factory for parametrized delimiter tests - Move fixtures to module level for reusability - Convert loop-based annotation value tests to @pytest.mark.parametrize Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Wrap pipe characters in double backticks in --list-syntax and --list-delimiter help strings. This prevents Sphinx's sphinx-click extension from interpreting them as RST substitution references, which was causing the docs build to fail with warnings. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Pipe characters inside markdown table cells were being interpreted as column delimiters, causing truncated content in the rendered HTML. Escaped with backslash (\|) to render as literal pipes. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Update CLI help and docs to say "when loading and dumping" (not just loading) - Simplify CLI help text to avoid formatting issues with special characters Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Move annotations from slot-level to schema-level in test schema to match actual implementation behavior - Remove unused variables from skipped test Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Tests for --list-syntax, --list-delimiter, and --list-strip-whitespace options in linkml-convert CLI. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Warn users if they provide an invalid list_syntax annotation value (e.g., typo like "plainetxt" instead of "plaintext"). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Convert test_csv_utils.py from hybrid unittest/pytest to pure pytest - Convert class-based tests to plain functions in test_csv_tsv_loader_dumper.py - Remove verbose comment blocks and chatty docstrings - Rename helper functions to public (drop underscore prefix): get_list_config_from_annotations, enhance_configmap_for_multivalued_primitives, strip_whitespace_from_lists Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Mechanical conversion: drop class wrapper, remove self, remove import unittest. No behavior change. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Only accept case-insensitive "true" or "false" for the list_strip_whitespace annotation, with a warning for invalid values. Aligns with the direction in #3144 to avoid YAML 1.1 boolean conventions (yes/no, on/off, 0/1) in CSV-related configuration. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ec4030c to
99696b2
Compare
|
Not strictly related to the issue of list delimiters, but since this is noted here:
This is not a SSSOM-specific limitation. SSSOM does not guarantee the order of the values in a multi-valued slot, because LinkML itself does not guarantee that. The behaviour of SSSOM here was directly taken from the behaviour of the LinkML runtime. The rules for RDF translations do not cover the case of multi-valued slots, so I don’t know what was the intention here, but in effect the LinkML runtime translates multi-valued slots as simple unstructured list of triples (even if the slot is defined with |
When enabled (via schema annotation or CLI flag), raises ValueError before serializing if any multivalued field value contains the list delimiter character. This catches round-trip corruption at write time rather than silently producing corrupt output. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
New Summary from 2026-02-11
Adds configurable multivalued field formatting for CSV/TSV serialization, via schema-level annotations and CLI options.
Before: Multivalued fields always serialize with brackets:
[value1|value2|value3]After: With
list_syntax: plaintext, fields serialize without brackets:value1|value2|value3Closes #3041. Addresses the core of #2581 (filed by @matentzn as a blocker for supporting common delimited formats like pipe-separated, semicolon-separated, etc.).
Origin and design
This follows the design @cmungall and I agreed on in our Dec 15 rolling meeting notes:
With mapping to json-flattener:
list_syntax: plaintext→csv_list_markers=("", ""),list_delimiter→csv_inner_delimiter.Deviation from spec: schema-level only
The Dec 15 spec discussed slot-level annotations overriding schema-level defaults via SchemaView. The implementation is schema-level only. json-flattener's
GlobalConfigdefinescsv_list_markersandcsv_inner_delimiterat the top level with no per-column configuration path, so slot-level overrides would require extending json-flattener itself. The primary use case (MIxS-style "semicolon-delimited, no brackets") is uniform across all multivalued fields in a schema, so this felt like the right scope for now.No changes to csvutils.py
Per Chris's guidance ("prefer no changes in csvutils.py"), configuration is handled in the loader and dumper rather than in the shared utility layer.
SSSOM alignment
@matentzn suggested checking how SSSOM handles multivalued field packing. With
list_syntax: plaintextandlist_delimiter: "|", our output matches SSSOM's TSV spec exactly (plaina|b|c, no brackets, strip whitespace). LinkML generalizes what SSSOM hardcodes — appropriate for a general-purpose modeling language where different schemas need different conventions.The SSSOM ecosystem is actively working on delimiter-in-value escaping (sssom#507, sssom-java#17). This PR doesn't implement escaping either, but the annotation-based configuration provides the right foundation to add it later.
What's not in scope
linkml-validateloader: This PR modifies thelinkml_runtimeloader/dumper (used bylinkml-convert), not the separatelinkml.validator.loaders.delimited_file_loader. Filed linkml-validate CSV/TSV loader lacks schema-aware parsing (boolean coercion, list splitting) #3147 to track unifying them.refuse_delimiter_in_dataannotation/CLI flag raises aValueErrorbefore serialization if any value contains the delimiter — preventing silent data corruption. Full escaping (e.g., SSSOM 1.1's backslash approach) can be added later.list_elements_ordered: true. Worth noting but orthogonal to this PR.Configuration reference
Schema annotations
list_syntaxpython,plaintextpythonpythonwraps lists in brackets[a|b|c],plaintexthas no bracketsa|b|clist_delimiter|(pipe)list_strip_whitespacetrue,falsetruerefuse_delimiter_in_datatrue,falsefalseValueErrorif any multivalued field value contains the delimiter, preventing silent data corruptionCLI options (override schema annotations)
linkml-convert -s schema.yaml -C Container -S items -t tsv \ --list-syntax plaintext \ --list-delimiter "|" \ --list-strip-whitespace \ --refuse-delimiter-in-data \ input.yaml--list-syntaxpythonorplaintext--list-delimiter--list-strip-whitespace/--no-list-strip-whitespace--refuse-delimiter-in-data/--no-refuse-delimiter-in-dataReview feedback addressed
From @cmungall's review (Feb 5):
From Copilot:
list_syntaxvaluesCoverage:
Files changed
docs/data/csvs.md— documentationpackages/linkml/src/linkml/converter/cli.py— CLI optionspackages/linkml_runtime/src/linkml_runtime/dumpers/delimited_file_dumper.py— output formattingpackages/linkml_runtime/src/linkml_runtime/loaders/delimited_file_loader.py— input parsingtests/linkml/test_utils/test_converter.py— CLI teststests/linkml_runtime/test_loaders_dumpers/test_csv_tsv_loader_dumper.py— runtime teststests/linkml_runtime/test_utils/test_csv_utils.py— utility tests