You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[yugabyte#28010] docdb: Account for snapshots in per-tablet on-disk size metrics
Summary:
The current `OnDiskSizeInfo` struct for a tablet does not account for snapshots, which can potentially have a lot of data. We need to account for snapshot size so that the cluster balancer can estimate how long remote bootstraps will take (that work will come in future diffs). This diff enhances the existing `ts_data_size` metric updater, which runs every 60s and updates each tserver's total on-disk size, to also set the on-disk size for each tablet, accounting for snapshots, MANIFEST files, the retryable requests file, and anything else in the data directory. It still accounts for hard links, so creating a snapshot when there is 1 GB of SSTs in the tablet will not change the total tablet size (because the snapshot is a hardlink to the SST in the active rocksdb).
The total on-disk size is stored in-memory in each tablet peer. This is passed back to the master via `StorageMetadata` (the same way we currently pass back the SST and WAL sizes), but is not currently used.
Also changed the metrics heartbeat updater to use a common function for getting tablet sizes.
**Upgrade/Downgrade Safety**
Added `total_size` field to `TabletDriveStorageMetadataPB` which is passed back as part of the metrics heartbeat from tserver to master. On upgrade/downgrade, master code is always newer than tserver code and thus may receive a value of 0 for a tablet. Future master code that uses this field must be tolerant to this.
Also added `total_on_disk_size` to the TabletStatusPB proto. This field is read on the metrics heartbeat path and paths where we call ListTablets (like the tablets UI).
The tablets UI has been changed to use the size with snapshots under "Total" (the previous field was just a sum of the sizes of consensus metadata, WALs, and SSTs). If the size with snapshots has not been calculated, "N/A" is displayed. Also made some UI readability modifications: the uncompressed SST size has been removed and the sub-components are now a sublist of Total:
Old UI:
{F377256}
New UI:
{F377270}
During upgrades/downgrades, some tservers may have the field and some may not. Both master and tserver code should be tolerant to getting a value of 0 or a stale value from any tserver.
Jira: DB-17628
Test Plan:
`./yb_build.sh --cxx-test master_heartbeat-itest --gtest_filter=DriveInfoTest.DriveInfo`
`./yb_build.sh --cxx-test integration-tests_tserver_metrics_heartbeat_data_provider-itest --gtest_filter TServerDriveInfoMetricsHeartbeatDataProviderITest.DriveInfo`
Reviewers: mhaddad
Reviewed By: mhaddad
Subscribers: ybase
Differential Revision: https://phorge.dev.yugabyte.com/D45458
0 commit comments