Skip to content

Throttling errors not caught by cert-manager retry mechanism + DescribeDomains inefficiency #40

@vitaaaaa1

Description

@vitaaaaa1

Description

In environments with ~20 ACK clusters and thousands of domains using DNS‑01 challenges for certificate issuance, we frequently encounter Throttling.User errors from Alibaba Cloud DNS API. The root causes are twofold:

  1. getHostedZone uses DescribeDomains without pagination – The method fetches all hosted zones (potentially hundreds or thousands) to verify a single domain, generating a huge number of API calls.
  2. Throttling errors are not correctly caught by cert-manager’s retry queue – When a throttling error occurs, the challenge is requeued but the exponential backoff is insufficient to handle sustained throttling, leading to an ever‑growing backlog of failed challenges.

Alibaba Cloud enforces a 10 QPS limit for the DescribeDomains API per user. With many clusters and domains running concurrent challenges, this limit is hit frequently. Once throttling starts, all challenge attempts fail, and the number of pending challenges increases, making the situation worse.

Sample error log

Re-invocation is essentially instantaneous—within seconds.

I0509 10:09:52.850029 dns.go:90] "presenting DNS01 challenge for domain" dnsName="xxx"
E0509 10:09:53.016992 controller.go:157] "re-queuing item due to error processing" err=<
alicloud: error getting hosted zones: alicloud: error describing domains: SDK.ServerError
ErrorCode: Throttling.User
Message: Request was denied due to user flow control.

I0509 10:09:53.017050 dns.go:90] "presenting DNS01 challenge for domain" dnsName="xxx"
E0509 10:09:53.617057 controller.go:157] "re-queuing item due to error processing" err=<
alicloud: error getting hosted zones: alicloud: error describing domains: SDK.ServerError
ErrorCode: Throttling.User
Message: Request was denied due to user flow control.

Proposed Fix

We have implemented a solution that:

  • Replaces DescribeDomains with DescribeDomainInfo – This API directly checks existence of a specific domain without listing all zones.
  • Adds proper retry logic with jitter – The method now retries up to 5 times with random delays (30–60 seconds) on throttling errors, giving the rate limiter time to recover.
  • Avoids pagination entirely – Since DescribeDomainInfo targets a single domain, pagination is not needed.

Modified getHostedZone method

func (c *aliDNSProviderSolver) getHostedZone(resolvedZone string) (string, error) {
    // ResolvedZone from cert-manager is the authoritative zone (e.g., "example.cn.").
    // We verify its existence in AliDNS using a single API call instead of paginating all domains.
    domain := util.UnFqdn(resolvedZone)

    request := alidns.CreateDescribeDomainInfoRequest()
    request.DomainName = domain

    const maxRetries = 5
    for attempt := 0; attempt < maxRetries; attempt++ {
        _, err := c.aliDNSClient.DescribeDomainInfo(request)
        if err == nil {
            return domain, nil
        }

        if isThrottlingError(err) && attempt < maxRetries-1 {
            wait := randomDuration(30, 60)
            fmt.Printf("alicloud: throttling error on DescribeDomainInfo attempt %d, retrying after %v\n", attempt+1, wait)
            time.Sleep(wait)
            continue
        }

        return "", fmt.Errorf("alicloud: error verifying domain %q: %v", domain, err)
    }

    return "", fmt.Errorf("alicloud: domain %q not found in AliDNS", domain)
}

Why This Works

DescribeDomainInfo has a higher QPS limit and is a lightweight lookup, drastically reducing the number of API calls per challenge.
Internal retries with backoff allow the solver to survive transient throttling without relying solely on cert-manager’s global requeue mechanism.
No pagination overhead – each challenge results in exactly one API call (plus retries) instead of multiple DescribeDomains calls that could also be paginated.
In our tests, this change reduced API calls by over 90% and eliminated throttling‑induced cascading failures.

Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions