Skip to content

Commit 914fc29

Browse files
[yugabyte#28010] docdb: Account for snapshots in per-tablet on-disk size metrics
Summary: The current `OnDiskSizeInfo` struct for a tablet does not account for snapshots, which can potentially have a lot of data. We need to account for snapshot size so that the cluster balancer can estimate how long remote bootstraps will take (that work will come in future diffs). This diff enhances the existing `ts_data_size` metric updater, which runs every 60s and updates each tserver's total on-disk size, to also set the on-disk size for each tablet, accounting for snapshots, MANIFEST files, the retryable requests file, and anything else in the data directory. It still accounts for hard links, so creating a snapshot when there is 1 GB of SSTs in the tablet will not change the total tablet size (because the snapshot is a hardlink to the SST in the active rocksdb). The total on-disk size is stored in-memory in each tablet peer. This is passed back to the master via `StorageMetadata` (the same way we currently pass back the SST and WAL sizes), but is not currently used. Also changed the metrics heartbeat updater to use a common function for getting tablet sizes. **Upgrade/Downgrade Safety** Added `total_size` field to `TabletDriveStorageMetadataPB` which is passed back as part of the metrics heartbeat from tserver to master. On upgrade/downgrade, master code is always newer than tserver code and thus may receive a value of 0 for a tablet. Future master code that uses this field must be tolerant to this. Also added `total_on_disk_size` to the TabletStatusPB proto. This field is read on the metrics heartbeat path and paths where we call ListTablets (like the tablets UI). The tablets UI has been changed to use the size with snapshots under "Total" (the previous field was just a sum of the sizes of consensus metadata, WALs, and SSTs). If the size with snapshots has not been calculated, "N/A" is displayed. Also made some UI readability modifications: the uncompressed SST size has been removed and the sub-components are now a sublist of Total: Old UI: {F377256} New UI: {F377270} During upgrades/downgrades, some tservers may have the field and some may not. Both master and tserver code should be tolerant to getting a value of 0 or a stale value from any tserver. Jira: DB-17628 Test Plan: `./yb_build.sh --cxx-test master_heartbeat-itest --gtest_filter=DriveInfoTest.DriveInfo` `./yb_build.sh --cxx-test integration-tests_tserver_metrics_heartbeat_data_provider-itest --gtest_filter TServerDriveInfoMetricsHeartbeatDataProviderITest.DriveInfo` Reviewers: mhaddad Reviewed By: mhaddad Subscribers: ybase Differential Revision: https://phorge.dev.yugabyte.com/D45458
1 parent 5cea2a1 commit 914fc29

11 files changed

+211
-47
lines changed

src/yb/integration-tests/master_heartbeat-itest.cc

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -65,6 +65,8 @@ DECLARE_int32(TEST_mini_cluster_registration_wait_time_sec);
6565
DECLARE_int32(tserver_unresponsive_timeout_ms);
6666
DECLARE_bool(persist_tserver_registry);
6767
DECLARE_bool(master_enable_universe_uuid_heartbeat_check);
68+
DECLARE_int32(data_size_metric_updater_interval_sec);
69+
DECLARE_int32(tserver_heartbeat_metrics_interval_ms);
6870

6971
namespace yb::integration_tests {
7072

@@ -461,6 +463,41 @@ TEST_F(MasterHeartbeatITest, TestRegistrationThroughRaftPersisted) {
461463
<< "TS registered through raft config should be unresponsive, not live";
462464
}
463465

466+
class DriveInfoTest : public MasterHeartbeatITest {
467+
public:
468+
void SetUp() override {
469+
ANNOTATE_UNPROTECTED_WRITE(FLAGS_data_size_metric_updater_interval_sec) = 1;
470+
ANNOTATE_UNPROTECTED_WRITE(FLAGS_tserver_heartbeat_metrics_interval_ms) = 1000;
471+
MasterHeartbeatITest::SetUp();
472+
}
473+
};
474+
475+
TEST_F(DriveInfoTest, DriveInfo) {
476+
CreateTable();
477+
auto table = table_name();
478+
auto& catalog_mgr = ASSERT_RESULT(mini_cluster_->GetLeaderMiniMaster())->catalog_manager();
479+
auto table_info = catalog_mgr.GetTableInfoFromNamespaceNameAndTableName(
480+
table.namespace_type(), table.namespace_name(), table.table_name());
481+
auto tablets = ASSERT_RESULT(table_info->GetTablets());
482+
483+
// Insert 1000 rows and flush to an SST.
484+
for (int i = 0; i < 1000; ++i) {
485+
PutKeyValue(Format("k$0", i), Format("v$0", i));
486+
}
487+
ASSERT_OK(mini_cluster_->CompactTablets());
488+
ASSERT_OK(WaitFor([&]() {
489+
for (const auto& tablet : tablets) {
490+
for (auto& replica : *tablet->GetReplicaLocations()) {
491+
if (replica.second.drive_info.sst_files_size == 0) return false;
492+
if (replica.second.drive_info.wal_files_size == 0) return false;
493+
if (replica.second.drive_info.uncompressed_sst_file_size == 0) return false;
494+
if (replica.second.drive_info.total_size == 0) return false;
495+
}
496+
}
497+
return true;
498+
}, 30s, "Waiting for drive info to be populated for all tablets"));
499+
}
500+
464501
class PersistTabletServerRegistryUpgradeTest : public MasterHeartbeatITest {
465502
public:
466503
void SetUp() override {

src/yb/integration-tests/minicluster-snapshot-test.cc

Lines changed: 82 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,7 @@
4141

4242
#include "yb/client/client.h"
4343
#include "yb/client/schema.h"
44+
#include "yb/client/snapshot_test_util.h"
4445
#include "yb/client/table_creator.h"
4546
#include "yb/client/table_info.h"
4647
#include "yb/client/transaction_manager.h"
@@ -83,6 +84,7 @@
8384

8485
DECLARE_int32(cleanup_split_tablets_interval_sec);
8586
DECLARE_int32(data_size_metric_updater_interval_sec);
87+
DECLARE_int32(timestamp_history_retention_interval_sec);
8688
DECLARE_bool(enable_db_clone);
8789
DECLARE_int32(load_balancer_initial_delay_secs);
8890
DECLARE_bool(master_auto_run_initdb);
@@ -195,8 +197,8 @@ Status DeleteSnapshotSchedule(MasterBackupProxy* proxy, const SnapshotScheduleId
195197

196198
Result<TxnSnapshotId> WaitNewSnapshot(MasterBackupProxy* proxy, const SnapshotScheduleId& id) {
197199
LOG(INFO) << "WaitNewSnapshot, schedule id: " << id;
198-
std::string last_snapshot_id;
199-
std::string new_snapshot_id;
200+
TxnSnapshotId last_snapshot_id;
201+
TxnSnapshotId new_snapshot_id;
200202
RETURN_NOT_OK(WaitFor(
201203
[&proxy, &id, &last_snapshot_id, &new_snapshot_id]() -> Result<bool> {
202204
// If there's a master leader failover then we should wait for the next cycle.
@@ -205,9 +207,10 @@ Result<TxnSnapshotId> WaitNewSnapshot(MasterBackupProxy* proxy, const SnapshotSc
205207
if (snapshots.empty()) {
206208
return false;
207209
}
208-
auto snapshot_id = snapshots[snapshots.size() - 1].id();
210+
auto snapshot_id = VERIFY_RESULT(FullyDecodeTxnSnapshotId(
211+
snapshots[snapshots.size() - 1].id()));
209212
LOG(INFO) << "WaitNewSnapshot, last snapshot id: " << snapshot_id;
210-
if (last_snapshot_id.empty()) {
213+
if (last_snapshot_id.IsNil()) {
211214
last_snapshot_id = snapshot_id;
212215
return false;
213216
}
@@ -219,7 +222,7 @@ Result<TxnSnapshotId> WaitNewSnapshot(MasterBackupProxy* proxy, const SnapshotSc
219222
}
220223
},
221224
kInterval * 5, "Wait new schedule snapshot"));
222-
return FullyDecodeTxnSnapshotId(new_snapshot_id);
225+
return new_snapshot_id;
223226
}
224227

225228
Status WaitForRestoration(
@@ -450,6 +453,16 @@ class PostgresMiniClusterTest : public pgwrapper::PgMiniTestBase {
450453

451454
MiniCluster* mini_cluster() { return cluster_.get(); }
452455

456+
Result<TableInfoPtr> GetTable(const std::string& table_name, const std::string& db_name) {
457+
auto leader_master = VERIFY_RESULT(cluster_->GetLeaderMiniMaster());
458+
for (const auto& table : leader_master->catalog_manager_impl().GetTables(GetTablesMode::kAll)) {
459+
if (table->name() == table_name && table->namespace_name() == db_name) {
460+
return table;
461+
}
462+
}
463+
return STATUS_FORMAT(NotFound, "Table $0 not found", table_name);
464+
}
465+
453466
Status CreateDatabase(
454467
const std::string& namespace_name,
455468
master::YsqlColocationConfig colocated = master::YsqlColocationConfig::kNotColocated) {
@@ -653,16 +666,6 @@ class PgCloneInitiallyEmptyDBTest : public PostgresMiniClusterTest {
653666
PostgresMiniClusterTest::DoTearDown();
654667
}
655668

656-
Result<TableInfoPtr> GetTable(const std::string& table_name, const std::string& db_name) {
657-
auto leader_master = VERIFY_RESULT(cluster_->GetLeaderMiniMaster());
658-
for (const auto& table : leader_master->catalog_manager_impl().GetTables(GetTablesMode::kAll)) {
659-
if (table->name() == table_name && table->namespace_name() == db_name) {
660-
return table;
661-
}
662-
}
663-
return STATUS_FORMAT(NotFound, "Table $0 not found", table_name);
664-
}
665-
666669
Status SplitTablet(const TabletId& tablet_id) {
667670
SplitTabletRequestPB req;
668671
SplitTabletResponsePB resp;
@@ -847,6 +850,70 @@ TEST_F(PgCloneTest, CloneVectorIndex) {
847850
ASSERT_OK(source_conn_->ExecuteFormat("DROP DATABASE $0", kTargetNamespaceName2));
848851
}
849852

853+
class TabletDataSizeMetricsTest : public PostgresMiniClusterTest {
854+
protected:
855+
void SetUp() override {
856+
ANNOTATE_UNPROTECTED_WRITE(FLAGS_data_size_metric_updater_interval_sec) = 1;
857+
PostgresMiniClusterTest::SetUp();
858+
messenger_ = ASSERT_RESULT(rpc::MessengerBuilder("test-msgr").set_num_reactors(1).Build());
859+
proxy_cache_ = std::make_unique<rpc::ProxyCache>(messenger_.get());
860+
snapshot_util_ = std::make_unique<client::SnapshotTestUtil>();
861+
snapshot_util_->SetProxy(&client_->proxy_cache());
862+
snapshot_util_->SetCluster(cluster_.get());
863+
}
864+
865+
void DoTearDown() override {
866+
messenger_->Shutdown();
867+
PostgresMiniClusterTest::DoTearDown();
868+
}
869+
870+
std::unique_ptr<client::SnapshotTestUtil> snapshot_util_;
871+
std::unique_ptr<rpc::Messenger> messenger_;
872+
std::unique_ptr<rpc::ProxyCache> proxy_cache_;
873+
};
874+
875+
TEST_F(TabletDataSizeMetricsTest, TotalOnDiskSizeIncludesSnapshots) {
876+
ASSERT_OK(CreateDatabase("testdb"));
877+
auto conn = ASSERT_RESULT(ConnectToDB("testdb"));
878+
ASSERT_OK(conn.Execute(
879+
"CREATE TABLE t1 (key INT PRIMARY KEY, value INT) SPLIT INTO 1 TABLETS"));
880+
auto table = ASSERT_RESULT(GetTable("t1", "testdb"));
881+
auto tablet_id = ASSERT_RESULT(table->GetTablets())[0]->tablet_id();
882+
auto tablet_peer = ASSERT_RESULT(
883+
cluster_->mini_tablet_server(0)->server()->tablet_peer_lookup()->GetServingTablet(tablet_id));
884+
LOG(INFO) << "Tablet id: " << tablet_id;
885+
886+
// Write some data and create a snapshot (this should create an SST file).
887+
ASSERT_OK(conn.Execute(
888+
"INSERT INTO t1 VALUES (generate_series(1,1000), generate_series(1,1000))"));
889+
ASSERT_RESULT(snapshot_util_->CreateSnapshot(table->id()));
890+
891+
// Get the on-disk size of the tablet.
892+
tablet::TabletOnDiskSizeInfo size_before;
893+
ASSERT_OK(WaitFor([&]() {
894+
size_before = tablet_peer->GetOnDiskSizeInfo();
895+
return size_before.sst_files_disk_size > 0 &&
896+
size_before.total_on_disk_size > size_before.sst_files_disk_size;
897+
}, 30s, "Wait for on-disk size to include the SST"));
898+
899+
// Delete the data in the tablet and re-compact. The SST file size should drop to 0.
900+
ASSERT_OK(conn.Execute("DELETE FROM t1"));
901+
ASSERT_OK(WaitFor([&]() {
902+
// Retrigger compactions until this peer has no SST files. This is required because if the
903+
// delete has only applied on the other two peers, the compaction would not do anything on this
904+
// peer.
905+
FlushAndCompactTablets();
906+
return tablet_peer->GetOnDiskSizeInfo().sst_files_disk_size == 0;
907+
}, 30s, "Wait for on-disk size to drop after delete"));
908+
909+
// Once the on-disk size metric updater has run, check that the total on-disk size has not
910+
// decrease by a significant portion of the size of the SST we deleted.
911+
SleepFor(FLAGS_data_size_metric_updater_interval_sec * 2s);
912+
auto size_after = tablet_peer->GetOnDiskSizeInfo();
913+
auto size_diff = size_before.total_on_disk_size - size_after.total_on_disk_size;
914+
ASSERT_LT(size_diff, size_before.sst_files_disk_size * 0.9);
915+
}
916+
850917
class TsDataSizeMetricsTest : public PgCloneTest {
851918
public:
852919
uint64_t GetTsDataSize() {

src/yb/master/catalog_entity_info.h

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -145,11 +145,12 @@ struct TabletReplicaDriveInfo {
145145
uint64 wal_files_size = 0;
146146
uint64 uncompressed_sst_file_size = 0;
147147
bool may_have_orphaned_post_split_data = true;
148+
uint64 total_size = 0;
148149

149150
std::string ToString() const {
150151
return YB_STRUCT_TO_STRING(
151152
sst_files_size, wal_files_size, uncompressed_sst_file_size,
152-
may_have_orphaned_post_split_data);
153+
may_have_orphaned_post_split_data, total_size);
153154
}
154155
};
155156

src/yb/master/master_heartbeat.proto

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -104,6 +104,7 @@ message TabletDriveStorageMetadataPB {
104104
optional uint64 wal_file_size = 3;
105105
optional uint64 uncompressed_sst_file_size = 4;
106106
optional bool may_have_orphaned_post_split_data = 5 [default = true];
107+
optional uint64 total_size = 6;
107108
}
108109

109110
message TabletLeaderMetricsPB {

src/yb/master/master_heartbeat_service.cc

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1438,6 +1438,7 @@ void MasterHeartbeatServiceImpl::ProcessTabletMetadata(
14381438
.wal_files_size = storage_metadata.wal_file_size(),
14391439
.uncompressed_sst_file_size = storage_metadata.uncompressed_sst_file_size(),
14401440
.may_have_orphaned_post_split_data = storage_metadata.may_have_orphaned_post_split_data(),
1441+
.total_size = storage_metadata.total_size(),
14411442
};
14421443
tablet->UpdateReplicaInfo(ts_uuid, drive_info, leader_lease_info);
14431444
}

src/yb/tablet/tablet.proto

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -52,7 +52,10 @@ message TabletStatusPB {
5252
optional bytes start_key = 5;
5353
// DEPRECATED.
5454
optional bytes end_key = 6;
55-
optional int64 estimated_on_disk_size = 7;
55+
// Estimated size of the tablet on disk. Includes consensus metadata, WALs, and SSTs, but does
56+
// not include snapshots, retryable requests, MANIFEST, and other structures.
57+
// Unlike total_on_disk_size, this is always up-to-date.
58+
optional int64 active_on_disk_size = 7;
5659
optional int64 consensus_metadata_disk_size = 11;
5760
optional int64 wal_files_disk_size = 12;
5861
optional int64 sst_files_disk_size = 13;
@@ -67,6 +70,9 @@ message TabletStatusPB {
6770
optional string pgschema_name = 20;
6871
repeated bytes vector_index_finished_backfills = 21;
6972
optional OpIdPB last_op_id = 22;
73+
// Estimated size of the tablet on disk, including snapshots, retryable requests, MANIFEST,
74+
// and other files in those directories. This is updated periodically, so may be stale or 0.
75+
optional int64 total_on_disk_size = 23;
7076
}
7177

7278
// Used to present the maintenance manager's internal state.

src/yb/tablet/tablet_peer.cc

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -798,7 +798,7 @@ void TabletPeer::GetTabletStatusPB(TabletStatusPB* status_pb_out) {
798798
std::lock_guard lock(lock_);
799799
DCHECK(status_pb_out != nullptr);
800800
DCHECK(status_listener_.get() != nullptr);
801-
const auto disk_size_info = GetOnDiskSizeInfo();
801+
const auto disk_size_info = GetOnDiskSizeInfoUnlocked();
802802
status_pb_out->set_tablet_id(status_listener_->tablet_id());
803803
status_pb_out->set_namespace_name(status_listener_->namespace_name());
804804
status_pb_out->set_table_name(status_listener_->table_name());
@@ -1585,6 +1585,11 @@ void TabletPeer::UnregisterMaintenanceOps() {
15851585
}
15861586

15871587
TabletOnDiskSizeInfo TabletPeer::GetOnDiskSizeInfo() const {
1588+
std::lock_guard lock(lock_);
1589+
return GetOnDiskSizeInfoUnlocked();
1590+
}
1591+
1592+
TabletOnDiskSizeInfo TabletPeer::GetOnDiskSizeInfoUnlocked() const {
15881593
TabletOnDiskSizeInfo info;
15891594

15901595
if (consensus_) {
@@ -1602,10 +1607,15 @@ TabletOnDiskSizeInfo TabletPeer::GetOnDiskSizeInfo() const {
16021607
info.wal_files_disk_size = log->OnDiskSize();
16031608
}
16041609

1610+
info.total_on_disk_size = total_on_disk_size_;
16051611
info.RecomputeTotalSize();
16061612
return info;
16071613
}
16081614

1615+
void TabletPeer::SetTabletOnDiskSize(size_t total_on_disk_size) {
1616+
total_on_disk_size_ = total_on_disk_size;
1617+
}
1618+
16091619
size_t TabletPeer::GetNumLogSegments() const {
16101620
auto log = log_atomic_.load(std::memory_order_acquire);
16111621
return log ? log->num_segments() : 0;

src/yb/tablet/tablet_peer.h

Lines changed: 25 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -84,7 +84,14 @@ struct TabletOnDiskSizeInfo {
8484
int64_t wal_files_disk_size = 0;
8585
int64_t sst_files_disk_size = 0;
8686
int64_t uncompressed_sst_files_disk_size = 0;
87-
int64_t sum_on_disk_size = 0;
87+
88+
// Sum of consensus metadata, WALs, and SSTs. Excludes snapshots, retryable requests, MANIFEST,
89+
// and other files in those directories. This is always up-to-date.
90+
int64_t active_on_disk_size = 0;
91+
92+
// Estimated size of the tablet on disk, including snapshots, retryable requests, MANIFEST,
93+
// and other files in those directories. This is updated periodically, so may be stale or 0.
94+
int64_t total_on_disk_size = 0;
8895

8996
template <class PB>
9097
static TabletOnDiskSizeInfo FromPB(const PB& pb) {
@@ -93,7 +100,8 @@ struct TabletOnDiskSizeInfo {
93100
.wal_files_disk_size = pb.wal_files_disk_size(),
94101
.sst_files_disk_size = pb.sst_files_disk_size(),
95102
.uncompressed_sst_files_disk_size = pb.uncompressed_sst_files_disk_size(),
96-
.sum_on_disk_size = pb.estimated_on_disk_size()
103+
.active_on_disk_size = pb.active_on_disk_size(),
104+
.total_on_disk_size = pb.total_on_disk_size(),
97105
};
98106
}
99107

@@ -103,19 +111,21 @@ struct TabletOnDiskSizeInfo {
103111
pb->set_wal_files_disk_size(wal_files_disk_size);
104112
pb->set_sst_files_disk_size(sst_files_disk_size);
105113
pb->set_uncompressed_sst_files_disk_size(uncompressed_sst_files_disk_size);
106-
pb->set_estimated_on_disk_size(sum_on_disk_size);
114+
pb->set_active_on_disk_size(active_on_disk_size);
115+
pb->set_total_on_disk_size(total_on_disk_size);
107116
}
108117

109118
void operator+=(const TabletOnDiskSizeInfo& other) {
110119
consensus_metadata_disk_size += other.consensus_metadata_disk_size;
111120
wal_files_disk_size += other.wal_files_disk_size;
112121
sst_files_disk_size += other.sst_files_disk_size;
113122
uncompressed_sst_files_disk_size += other.uncompressed_sst_files_disk_size;
114-
sum_on_disk_size += other.sum_on_disk_size;
123+
active_on_disk_size += other.active_on_disk_size;
124+
total_on_disk_size += other.total_on_disk_size;
115125
}
116126

117127
void RecomputeTotalSize() {
118-
sum_on_disk_size =
128+
active_on_disk_size =
119129
consensus_metadata_disk_size +
120130
sst_files_disk_size +
121131
wal_files_disk_size;
@@ -495,6 +505,10 @@ class TabletPeer : public std::enable_shared_from_this<TabletPeer>,
495505

496506
bool HasSufficientDiskSpaceForWrite();
497507

508+
TabletOnDiskSizeInfo GetOnDiskSizeInfo() const EXCLUDES(lock_);
509+
510+
void SetTabletOnDiskSize(size_t total_on_disk_size);
511+
498512
protected:
499513
friend class RefCountedThreadSafe<TabletPeer>;
500514
friend class TabletPeerTest;
@@ -598,7 +612,7 @@ class TabletPeer : public std::enable_shared_from_this<TabletPeer>,
598612
Status CheckOperationAllowed(
599613
const OpId& op_id, consensus::OperationType op_type) override;
600614
// Return granular types of on-disk size of this tablet replica, in bytes.
601-
TabletOnDiskSizeInfo GetOnDiskSizeInfo() const REQUIRES(lock_);
615+
TabletOnDiskSizeInfo GetOnDiskSizeInfoUnlocked() const REQUIRES(lock_);
602616

603617
bool FlushBootstrapStateEnabled() const;
604618

@@ -625,6 +639,11 @@ class TabletPeer : public std::enable_shared_from_this<TabletPeer>,
625639
std::atomic<bool> flush_bootstrap_state_enabled_{false};
626640
std::shared_ptr<TabletBootstrapStateFlusher> bootstrap_state_flusher_;
627641

642+
// The size of the tablet on disk in bytes, including snapshots, retryable requests, MANIFEST,
643+
// and other files in those directories. This can be stale as it is only updated every
644+
// FLAGS_data_size_metric_updater_interval_sec seconds.
645+
std::atomic<size_t> total_on_disk_size_{0};
646+
628647
DISALLOW_COPY_AND_ASSIGN(TabletPeer);
629648
};
630649

0 commit comments

Comments
 (0)