[Python] Support tree-model TsFiles in TsFileDataFrame#816
Conversation
Detect model kind at reader open, synthesize a virtual TableEntry for tree files (root segment as table name, _col_1.._col_N for positional path-depth headers, union of all device measurements as fields), and preserve per-device measurement ownership via series_stats_by_ref so the dataset layer never registers phantom (device, field) pairs. The dataset surface (__len__, list_timeseries, __getitem__, .loc) is unchanged for both models. New public API: read-only df.model and df.list_timeseries_metadata(). Mixing table and tree files in one load set is rejected with a clear error. Tree-model reads route through query_table_on_tree with client-side device filtering, working around two cwrapper bugs (multi-segment path rejection and duplicate col_* leak across queries). Tree-mode rendering omits the leading "table" column, uses _col_i headers, prints None tag cells as "None"; the repr header now carries the model marker. Tests: 4 new tree-model tests cover metadata + repr layout, single- series and aligned reads, list_timeseries_metadata column shape, and mixed-model load rejection. All 40 tests pass.
Phase 1a-5 of the Python-side memory optimization for TsFileDataFrame indexes at 30k-device scale: 81.20 MB -> 48.90 MB (-32.30 MB, -39.8%). - metadata: introduce SeriesStats NamedTuple + empty_series_stats() singleton; replace per-series 6-key dict (~360 B) with NamedTuple (~120 B). [Phase 2: -10.53 MB] - dataframe: drop the _DerivedCache class entirely; inline its three members (catalog/order/path lookups) into _LogicalIndex methods computed lazily from existing data. [Phase 4: -10.85 MB] - dataframe: replace _LogicalIndex.device_refs (List[List[DeviceRef]]) with device_bounds (List[Tuple[Optional[int], Optional[int]]]) aggregated at register time so _query_aligned reads bounds in O(1) without holding per-reader DeviceRef tuples. [Phase 5: -3.42 MB] - dataframe: drop redundant series_ref_set; use series_ref_map keys for membership checks. [Phase 1a: -2.00 MB] - reader: get_series_info_by_ref reads SeriesStats attributes and exposes them as the existing dict shape (no API change for callers). 40/40 dataset tests still pass. See notes/TsFileDataFrame实施总结.md section 6 for per-phase measurement breakdown.
Unify table/tree-model semantics so the dataset surface only carries real series. Previously the table-model branch padded series_stats_by_ref with empty_series_stats() placeholders for every (device, field) cell declared in the schema but never written, producing a Cartesian-product index that grew linearly with schema width even when most cells were empty. After this change, both models populate series_stats_by_ref only for cells with statistic.row_count > 0. The dataset view becomes 'real devices x real fields', matching the principle that 'as many devices (and series) exist as were physically written'. Changes: - metadata: SeriesStats fields tighten from Optional[int] to int; delete _EMPTY_SERIES_STATS constant and empty_series_stats() helper; drop unused Optional import. - reader: _metadata_field_stats now filters on statistic.has_statistic + row_count > 0 (was timeline_statistic); placeholder branch in _cache_metadata_table_model removed; iter_owned_series_refs docstring rewritten as a single unified description for both models. - tests: new test_dataset_omits_table_model_phantom_series_for_skipped_cells proves Tablet-skipped cells stay out of list_timeseries / __len__ / series_ref_map and tsdf[skipped_path] raises KeyError. Bench impact: - 30k devices x 1 field full-density (existing bench): 0 change (no phantoms existed in this scenario). - 5k devices x 5 fields x 60% density (new sparse bench): series_ref_map 15.93 -> 9.55 MB, series_stats_by_ref 7.53 -> 4.51 MB, tracked total ~26.5 -> 16.30 MB (~38% reduction). 41/41 dataset tests pass.
After Phase 6 unified table/tree-model semantics, iter_owned_series_refs became a no-op wrapper around catalog.series_stats_by_ref.keys(). Its docstring still existed only to explain the (now-impossible) phantom- series concern. Remove the method and inline direct iteration at the 3 call sites: - reader.iter_series_paths / iter_series_refs: iterate the catalog dict directly (matches the pattern already used by series_count). - dataframe._register_reader: drop the getattr fallback (which existed only to handle a hypothetical reader without iter_owned_series_refs); iterate catalog.series_stats_by_ref directly. Net: -22 / +9 lines, no behavior change. 41/41 dataset tests pass.
Rename the cross-file unified index to _DataFrameCatalog and shorten its field names for clarity: - device_order -> devices - device_index_by_key -> device_index - device_bounds -> device_time_bounds - series_refs_ordered -> series - series_ref_map -> series_shards Also slim the SeriesStats docstring in metadata.py and update tests to the new names. No behavior change.
Slim docstrings on _cache_metadata_tree_model, _read_series_by_row_tree and _read_arrow_tree to the one-line essentials.
tsdf.loc[..., [name, idx]] where both refs resolve to the same logical series previously appended the read fragment once per spec entry, producing duplicated timestamps and NaN-padded rows downstream. Append once per series in _query_aligned and add a regression test.
JackieTien97
left a comment
There was a problem hiding this comment.
Code Review: [Python] Support tree-model TsFiles in TsFileDataFrame
Overall this is a well-structured PR with solid test coverage. The tree-model integration, phantom-series elimination, and index slim-down are all clean and well-motivated. I have a few concerns — one potential correctness issue, one performance concern, and several nits. See inline comments.
| _OVERLAP_ROW_CHUNK_SIZE = 256 | ||
|
|
||
| MODEL_TABLE = "table" | ||
| MODEL_TREE = "tree" |
There was a problem hiding this comment.
MODEL_TABLE and MODEL_TREE are defined identically in both reader.py:37-38 and dataframe.py:54-55. The dataframe module already imports from reader (from .reader import ...). Consider defining these constants once in reader.py (or metadata.py) and importing them in dataframe.py to avoid drift.
The test file imports MODEL_TABLE from reader, so both definitions are actively used — easy to get them out of sync in the future.
| not timeline_statistic.has_statistic | ||
| or timeline_statistic.row_count <= 0 | ||
| ): | ||
| if not statistic.has_statistic or statistic.row_count <= 0: |
There was a problem hiding this comment.
The filter condition changed from checking timeline_statistic.has_statistic to checking statistic.has_statistic. This is a semantic shift: the old code filtered based on the timeline (timestamp axis), while the new code filters on the value statistic.
Are there cases where statistic.has_statistic is False but timeline_statistic.has_statistic is True, or vice versa? If the two can diverge, this changes which series are visible. The docstring says "iff its native statistic block is populated" but the old behavior was gating on timeline_statistic.
Also, line 388 now reads timeline_statistic unconditionally after the statistic check passes. If a measurement has statistic.has_statistic=True but timeline_statistic.has_statistic=False (or timeline_statistic is None), lines 393-395 would raise. Worth a defensive check or a note explaining why this can't happen.
| timestamps = [] | ||
| values = [] | ||
| skipped = 0 | ||
| with self._reader.query_table_on_tree([field_name]) as result_set: |
There was a problem hiding this comment.
Both _read_series_by_row_tree and _read_arrow_tree do a full query_table_on_tree scan over the entire file and then client-side filter by target_path_segments. For a file with many devices, this means reading every row to extract data for a single device.
This is acknowledged in the PR description as a cwrapper limitation workaround, which is fine for now — but worth flagging for future readers: this is O(total_rows) per device, so an aligned read across N devices in the same file becomes O(N × total_rows). The comments at the call-site document the "why" well; a brief # O(total_rows) — see PR #816 for cwrapper limitations here would also help future profilers find the hot path.
| return list(self.iter_series_paths()) | ||
|
|
||
| @property | ||
| def series_count(self) -> int: |
There was a problem hiding this comment.
MetadataCatalog.series_count (metadata.py:148) still computes the cross-product of all (device, field) combos, but the reader's series_count property (line 111) now uses len(series_stats_by_ref) which counts only physically-present series.
For sparse schemas these two diverge. Tests at test_tsfile_dataset.py:688 and :789 still call reader.catalog.series_count and happen to pass because those fixtures are fully dense, but this is a latent inconsistency.
Consider either: (a) updating MetadataCatalog.series_count to also return len(self.series_stats_by_ref), or (b) deprecating it in favor of the reader property.
| parent_shards = parent._index.series_shards | ||
| subset_shards = {ref: parent_shards[ref] for ref in subset_refs} | ||
| obj._index = _DataFrameCatalog( | ||
| model=parent._index.model, |
There was a problem hiding this comment.
_from_subset creates a new _DataFrameCatalog but does not propagate tables_with_sparse_tag_values or sparse_device_indices_by_compressed_path. If a view (created by slicing or boolean indexing) later calls _resolve_series_name on a series whose device has sparse tags, the lookup will fail because the compressed-path index is empty.
This was also missing in the old _LogicalIndex code, so it's a pre-existing gap — but since this PR refactors _from_subset, it's worth noting. A one-line fix:
tables_with_sparse_tag_values=parent._index.tables_with_sparse_tag_values,
sparse_device_indices_by_compressed_path=parent._index.sparse_device_indices_by_compressed_path,| device_key, table_entry, _ = self._get_series_components(series_ref) | ||
| field_stats = self._cache.field_stats[series_ref] | ||
| # Aggregate per-shard timeline stats lazily on demand for this series. | ||
| field_stats = _build_field_stats(self._index.series_shards[series_ref]) |
There was a problem hiding this comment.
_build_field_stats is now called lazily on every _build_series_info invocation instead of being precomputed once in _DerivedCache. This means:
list_timeseries_metadata()calls_build_series_infofor every series, each of which calls_build_field_stats. For N series across K shards, this is O(N×K) reader calls.__repr__and__getitem__(column_name)also go through_build_series_info.
The old approach precomputed these stats once at load time. The PR description frames this as removing _DerivedCache, but the trade-off is that repeated access patterns (e.g., interactive exploration calling repr() then list_timeseries_metadata() then column access) now recompute stats from scratch each time.
For typical usage with moderate series counts this is fine. For wide schemas (the PR's own benchmark cites 5k devices × 5 fields), it could add up. Worth noting in a comment or considering a @functools.lru_cache on the hot path.
| ] | ||
| # Only the trailing expected_path_len col_i cells are genuine; the | ||
| # leading duplicates are stale from prior queries on this reader. | ||
| col_indices = all_col_indices[-expected_path_len:] |
There was a problem hiding this comment.
The "stale col_ columns from prior queries" workaround (all_col_indices[-expected_path_len:]) appears in both _read_series_by_row_tree (line 556) and _read_arrow_tree (line 747). This is a fragile assumption about cwrapper internal state.
If the cwrapper behavior changes (e.g., fixed to not leak duplicate columns, or leaks in a different pattern), this slicing logic will silently break. Consider at minimum an assertion:
assert len(col_indices) == expected_path_len, (
f"Expected {expected_path_len} col_ columns, got {len(col_indices)} "
f"(total {len(all_col_indices)})"
)This would fail fast if the assumption is violated rather than silently returning wrong data.
| return f"TsFileDataFrame({total} time series, {len(self._paths)} files)\n" | ||
|
|
||
| return ( | ||
| f"TsFileDataFrame({model_marker} model, {total} time series, " |
There was a problem hiding this comment.
Nit: PEP 8 expects two blank lines between top-level methods in a class. _repr_header ends at line 1059 and __repr__ starts at line 1060 — missing a blank line separator.
| writer.write_row_record( | ||
| RowRecord( | ||
| "root.ln.wf01.wt01", | ||
| t, |
There was a problem hiding this comment.
The test suite covers single tree-model files, but there's no test for loading multiple tree-model files into one TsFileDataFrame. The multi-file merge path (via _register_reader + device_time_bounds aggregation) has tricky edge cases for tree-model:
- Two files with the same device but different measurement subsets — do we get the union of fields? Do time bounds merge correctly?
- Two files with different
max_depth— does the tag padding still work correctly?
These are the same merge paths that table-model tests cover, but tree-model's synthetic table construction introduces new failure modes around the union-fields and padding logic.
| continue | ||
| current_root = full_segments[0] | ||
| if root_name is None: | ||
| root_name = current_root |
There was a problem hiding this comment.
This validation rejects tree files with multiple root segments (root_name != current_root). In practice, IoTDB tree-model files virtually always use root as the single root. However, the synthetic table uses the root segment as the table_name, so if a file ever had devices under multiple roots, the virtual table would be ambiguous.
The error message is clear and this is the right behavior — just noting that I'd expect this to never actually fire in practice. If it does, it signals a seriously malformed file rather than a normal use case.
Summary
This PR teaches the Python
TsFileDataFrameto read tree-model TsFiles(in addition to the existing table-model support) and, in the same series of
commits, slims the underlying dataset index so it scales to wide / sparse
schemas without paying for phantom
(device, field)cells.The user-facing dataset surface (
__len__,list_timeseries,__getitem__,.loc, aligned reads) is unchanged for both models — tree-model filesjust become loadable through the same API.
What changed
Tree-model support
model; otherwise table model.
TableEntry:_col_1 .. _col_N(one positional column per remainingpath segment, padded with
Nonefor shorter devices)(device_id, field_idx)pairs that are actually written on disk inseries_stats_by_ref, so the dataset never advertises phantom series.query_table_on_treewith client-sidedevice filtering. This works around two cwrapper limitations on the
current native build (
query_tree_by_rowrejects multi-segment devicepaths, and successive
query_table_on_treecalls leak duplicatecol_*columns); both are documented inline at the cwrapper boundary.
_col_iheaders, print
Nonetag cells as"None", and surface the modelmarker in the repr header.
with a clear error.
Dataset index slim-down
SeriesStatsbecomes aNamedTuple(~120 B vs. the previous ~360 Bper-series dict).
_DerivedCacheremoved; lookups computed lazily on top of existingstate.
device_refs: List[List[DeviceRef]]collapsed into apre-aggregated
device_time_bounds: List[Tuple[Optional[int], Optional[int]]], so_query_alignedreads bounds in O(1).series_ref_set(useseries_ref_mapkeys).longer pads
series_stats_by_refwith empty placeholders forschema-declared but never-written cells. The dataset view is now
strictly “real devices × real fields” in both models.
_LogicalIndex→_DataFrameCatalog, shorten the fiveinternal field names (
devices/device_index/device_time_bounds/
series/series_shards), inline the now-trivialiter_owned_series_refswrapper.New public API
TsFileDataFrame.model— read-only model marker ("table"or"tree").TsFileDataFrame.list_timeseries_metadata()— per-series metadata as aflat tabular view (works identically for both models).
Compatibility
loads table-model TsFiles continues to work without modification.
SeriesStatsinteger fields tighten fromOptional[int]toint. Thesurrounding
get_series_info_by_refstill exposes them as the existingdict shape, so callers do not see an API change.
Memory impact
Two benches were used because the wins land in different shapes of
schema.
Bench A — 30 k devices × 1 field, full density
SeriesStatsNamedTuple_DerivedCacheremovaldevice_time_boundsaggregationseries_ref_setremoval(device, field)cellsEnd-to-end: 81.20 → 54.40 MB (−33 %). Dropping phantom cells brings
nothing here because every device writes the single declared field;
there are no skipped cells to prune.
Bench B — 5 k devices × 5 fields × 60 % density (sparse / wide)
series_ref_mapseries_stats_by_refDropping phantom cells alone brings −38 % on this fixture; the
sparser and wider the schema, the larger the win.
Testing
python -m pytest python/tests/test_tsfile_dataset.py→ 41 / 41 pass.read, multi-field aligned read,
list_timeseries_metadatacolumnshape, and the mixed-model load rejection.
test_dataset_omits_table_model_phantom_series_for_skipped_cells)proves Tablet-skipped cells stay out of
list_timeseries,__len__,series_ref_map, and thattsdf[skipped_path]raisesKeyError.