Skip to content

test(e2e): add scaling & KV functionality tests; extract shared helpers#192

Merged
ahrtr merged 1 commit into
etcd-io:mainfrom
ballista01:e2e-tests
Sep 17, 2025
Merged

test(e2e): add scaling & KV functionality tests; extract shared helpers#192
ahrtr merged 1 commit into
etcd-io:mainfrom
ballista01:e2e-tests

Conversation

@ballista01

@ballista01 ballista01 commented Aug 19, 2025

Copy link
Copy Markdown
Member

Summary

This PR extends the E2E suite under test/e2e/ with realistic cluster-lifecycle scenarios, PVC-backed clusters, and shared helpers to reduce duplication and flakiness. It adds scaling, pod recovery with persistent storage, and KV/replication checks using etcdctl from inside cluster pods.
No controller or production code changes are included.

What’s in this PR

  • test/e2e/e2e_test.go

    • Keeps existing TestInvalidClusterSize, TestClusterHealthy.

    • New: TestScaling (table-driven: ScaleInFrom3To1, ScaleOutFrom1To3)

      • Verifies the controller updates StatefulSet replicas.
      • Checks etcd member list count and ensures no learners remain after scaling.
    • New: TestPodRecovery

      • Starts a 3-node PVC-backed cluster, deletes one pod, waits for recreation, and verifies it reuses the PVC.
      • Asserts member ID consistency before/after recovery (no unintended re-adds).
      • Validates replication by put on one member and get from the recovered member.
    • Updated: TestEtcdClusterFunctionality

      • Runs on a PVC-backed 3-node cluster.
      • Validates endpoint health; put/get/del; cross-member read for replication; and ensures no learners in a healthy cluster.
  • test/e2e/helpers_test.go (new)

    • Cluster + storage

      • createEtcdClusterWithPVC — creates an EtcdCluster with StorageSpec (64Mi) and PVC.
      • getAvailableStorageClass — resolves StorageClass via ETCD_E2E_STORAGECLASS env or default SC; skips if none suitable.
    • Lifecycle / waiting

      • waitForSTSReadiness — waits for StatefulSet ReadyReplicas using e2e-framework wait (tuned timeouts/intervals).
      • scaleEtcdCluster — patches spec.size and waits for STS readiness.
      • cleanupEtcdCluster — best-effort deletion.
    • Introspection

      • execInPod — runs etcdctl inside pods and returns stdout/stderr.
      • getEtcdMembers — parses etcdctl member list -w json into typed structs to build name → memberID map (more robust than string parsing).
      • verifyPodUsesPVC — asserts a pod mounts the expected PVC (by prefix).
    • Types: EtcdMember, EtcdMemberList.

Motivation

  • Address prior review feedback to split the large refactor into focused pieces.

  • Add coverage for scenarios that require a real cluster (DNS/networking/storage) and are not reliable at unit-test level:

    • Scaling: keeps spec.size, StatefulSet replicas, and etcd member list in sync; ensures learners are promoted.
    • Persistence & recovery: validates PVC reuse and member ID stability across pod recreation.
    • Data-path checks: etcdctl health + KV + cross-member reads to prove replication.

How it validates behavior

  • Executes etcdctl inside pods (execInPod) to observe internal cluster state (health, member list, KV).
  • Uses JSON parsing for membership (getEtcdMembers) to avoid brittle string checks.
  • waitForSTSReadiness and targeted wait.For loops reduce flakes by waiting for ReadyReplicas and pod recreation.
  • Storage-aware helpers ensure tests run with PVC-backed clusters and skip gracefully when no StorageClass is available.

Prior work

Huge thanks to #51 by @abdurrehman107! It inspired the feature-based layout with sigs.k8s.io/e2e-framework, the Setup - Assess - Teardown flow and the use of wait.For + conditions.ResourceScaled for readiness. Appreciate the groundwork!

Notes for reviewers

  • If your cluster has no default StorageClass, it's best to ETCD_E2E_STORAGECLASS=<your-sc> before running E2E.
  • Timeouts are conservative (5m for STS readiness; 3m for pod recreation) to minimize flakes; happy to tune based on CI signal.

@k8s-ci-robot

Copy link
Copy Markdown

Hi @ballista01. Thanks for your PR.

I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ballista01 ballista01 changed the title E2e tests test(e2e): add scaling & learner-promotion scenarios and shared helpers Aug 19, 2025
@ballista01 ballista01 marked this pull request as ready for review August 19, 2025 17:42
@ahrtr

ahrtr commented Aug 19, 2025

Copy link
Copy Markdown
Member

/ok-to-test

Comment thread test/e2e/e2e_test.go Outdated
Comment thread test/e2e/e2e_test.go Outdated
Comment on lines +191 to +195
command := []string{
"etcdctl", "member", "add", "etcd-promote-learner-2",
"--peer-urls=http://etcd-promote-learner-2.etcd-promote-learner.etcd-operator-system.svc.cluster.local:2380",
"--learner",
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If TLS is enabled, the command will fail. More importantly, why adding member directly? It should be the responsibility of the controller.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, maybe we should keep the test case in unit test rather than make it E2E.

func TestPromoteLearner(t *testing.T) {

Removed.

@ahrtr

ahrtr commented Aug 20, 2025

Copy link
Copy Markdown
Member

also cc @frederiko

@ballista01 ballista01 force-pushed the e2e-tests branch 2 times, most recently from dc43cbe to 136dcac Compare August 20, 2025 19:15
@ballista01 ballista01 changed the title test(e2e): add scaling & learner-promotion scenarios and shared helpers test(e2e): add scaling & KV functionality tests; extract shared helpers Aug 20, 2025

@ivanvc ivanvc left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pull request, @ballista01! I left a couple of comments :)

Comment thread test/e2e/e2e_test.go Outdated
Comment thread test/e2e/e2e_test.go Outdated
}
}

// TODO: TestPodRecovery is commented out - would like maintainer feedback on expected behavior

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is worth discussing in a different issue.

In what scenarios would the pod get deleted? The scheduler shouldn't delete the pod, unless I'm missing/misunderstanding something.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review! I've created an issue at #193 for discussion.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick update: This only repros when the cluster is created without persistent storage, or when a Pod and its PVC are removed together (effective permanent loss). With PVC-backed clusters (typical in production), deleting a Pod behaves as expected: the StatefulSet recreates the same ordinal with its PVC and the member rejoins; no downscale.

If you prefer, I’ll close #193 as not reproducible for the transient case and send a small docs PR clarifying the above.

Comment thread test/e2e/helpers_test.go Outdated
Comment thread test/e2e/helpers_test.go
Comment thread test/e2e/e2e_test.go Outdated

// Verify controller promoted all learners to voting members
for _, line := range memberLines {
if strings.Contains(line, "isLearner=true") {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you manually verified the output of command etcdctl member list?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched to parsing etcdctl JSON with etcd’s official types instead of string matching.

Comment thread test/e2e/helpers_test.go Outdated
Comment thread test/e2e/e2e_test.go Outdated
Comment thread test/e2e/helpers_test.go Outdated
Comment thread test/e2e/e2e_test.go Outdated
Comment thread test/e2e/helpers_test.go Outdated
Comment thread test/e2e/e2e_test.go Outdated
Comment thread test/e2e/helpers_test.go Outdated
Comment thread test/e2e/helpers_test.go Outdated
Comment thread test/e2e/helpers_test.go Outdated
HashKV etcdserverpb.HashKVResponse `json:"HashKV"`
}
if err := json.Unmarshal([]byte(stdout), &entries); err != nil {
t.Fatalf("Failed to parse endpoint hashkv JSON: %v. Raw: %s", err, truncate(stdout, 512))

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

print the whole response? It won't too long.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean removing the truncate

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@ahrtr

ahrtr commented Sep 15, 2025

Copy link
Copy Markdown
Member

This PR high level looks good to me.

@frederiko @ivanvc do you have comment on the implementation? PTAL, thx

@ahrtr

ahrtr commented Sep 16, 2025

Copy link
Copy Markdown
Member

Please rebase & squash the commit, thx

@ballista01

Copy link
Copy Markdown
Member Author

Please rebase & squash the commit, thx

on it

@ivanvc ivanvc left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks, @ballista01.

Comment thread test/e2e/helpers_test.go Outdated
- add scenarios for invalid sizes, controller health, scaling, pod recovery, and cluster-wide data consistency
- add helpers to create PVC-backed clusters, wait for STS readiness, inspect member lists, and verify PVC usage

Signed-off-by: Wenxue Zhao <ballista01@outlook.com>

@ahrtr ahrtr left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Thank you!

@k8s-ci-robot

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahrtr, ballista01, ivanvc

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ahrtr ahrtr merged commit d18a538 into etcd-io:main Sep 17, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants