Skip to content

Add grace period for newly created CDC streams (suppress primary/alternate SLA, redirect latency to newlyOnboardedLatencyMs)#1004

Merged
khandelwal-ayush merged 10 commits into
linkedin:masterfrom
mittalprince:datapipes-33203-cdc-bootstrap-sla-grace
May 4, 2026
Merged

Add grace period for newly created CDC streams (suppress primary/alternate SLA, redirect latency to newlyOnboardedLatencyMs)#1004
khandelwal-ayush merged 10 commits into
linkedin:masterfrom
mittalprince:datapipes-33203-cdc-bootstrap-sla-grace

Conversation

@mittalprince
Copy link
Copy Markdown
Collaborator

@mittalprince mittalprince commented Apr 28, 2026

Summary

  • Add a new newStreamGracePeriodMs config in EventProducer (default 2h, operator-overridable via brooklin.server.eventProducer.newStreamGracePeriodMs in server.properties). While newly created CDC streams drain their initial CDC backlog from EARLIEST, source-to-destination latencies are inherently large;
    without a grace window every event would be reported as primary-SLA-violating and lag alerts wired to eventsLatencyMs would fire.
  • During the grace window:
    • Primary SLA (eventsProducedWithinSla / eventsProducedOutsideSla) is suppressed.
    • Alternate SLA (eventsProducedWithinAlternateSla / eventsProducedOutsideAlternateSla) is suppressed.
    • eventsLatencyMs is redirected to eventsLatencyMsSlaIneligible so lag alerts wired to eventsLatencyMs do not fire on initial backlog drain. The catch-up curve stays observable on the dedicated histogram.
  • Warning logs and other rate / volume / send-latency metrics are NOT gated.
  • Restrict the grace gate to CDC sources via an explicit isCdcSource() check (single-slash-URI heuristic, the same one used by the throughput-attribution metrics). MirrorMaker (kafka://) and other non-CDC sources are unaffected — neither SLA suppression nor the latency-histogram redirect applies.
  • When multiple datastreams share a DatastreamTask via dedup, use the oldest creation time across the task so a freshly deduped stream cannot suppress metrics on long-running siblings.
  • All metric-emission gating goes through a single shouldEmitMetric() wrapper. Future suppression conditions (e.g. per-datastream opt-out flags) should be ORed into that one method so call sites stay simple.

Notes / scope

  • This PR was originally also going to change the no-offset-found fallback in AbstractKafkaBasedConnectorTask from LATEST to EARLIEST, but that change has been reverted (390132ee). LinkedIn-internal CDC connectors no longer use this base class (migrated to AbstractXinfraBasedConnectorTask); CDC's EARLIEST
    behaviour is handled in the internal connector subclass.
  • MySQL CDC will continue to default to LATEST since it does not depend on existing snapshots.
  • Distinct from Add per-datastream SLA metric opt-out via system.disableSlaMetric  #1005: that PR introduces slaExcludedLatencyMs for a per-datastream opt-out flag (permanent, manual). This PR introduces newlyOnboardedLatencyMs for an automatic time-bounded grace window for new CDC streams. Different histograms, different triggers, no merge conflict.

Test plan

  • 11 tests in TestEventProducer covering:
    • New CDC stream in grace → primary and alternate SLA both suppressed
    • Old CDC stream past grace → primary and alternate SLA emit normally
    • MirrorMaker source not affected even with fresh CREATION_MS
    • Fail-open when CREATION_MS is missing or malformed
    • eventsLatencyMs redirected to newlyOnboardedLatencyMs during grace
    • eventsLatencyMs fires (and newlyOnboardedLatencyMs does not) post-grace
    • Custom newStreamGracePeriodMs config override honoured
    • Deduped task with oldest stream past grace → metrics emit normally
    • Deduped task with all streams within grace → metrics suppressed/redirected uniformly
  • All 16 tests in TestEventProducer pass under JDK 8 (project's required toolchain):
    JAVA_HOME=<jdk8> ./gradlew :datastream-server:test --tests com.linkedin.datastream.server.TestEventProducer

…period for new CDC streams

- AbstractKafkaBasedConnectorTask: change handleNoOffsetForPartitionException to default
to EARLIEST instead of LATEST so CDC consumers don't lose bootstrap snapshot + change
events when a partition has no committed offset. Behaviour unchanged when
AUTO_OFFSET_RESET_CONFIG is set explicitly (e.g. KafkaMirrorMaker).
- EventProducer: add newStreamSlaGracePeriodMs (default 2h). New CDC streams have large
initial lag while draining the bootstrap window from EARLIEST, which would otherwise
trigger false SLA violations. SLA counters are suppressed during the grace window;
latency histograms and warning logs continue to fire so observability is preserved.
- Restrict the grace gate to CDC sources via _sourceDatabase != null (single-slash URI
scheme) so MirrorMaker streams report SLA from the first event.
- Use min(creationMs) across deduped datastreams on a task so a freshly deduped stream
cannot suppress SLA on long-running siblings.
- 9 new unit tests in TestEventProducer covering grace ON/OFF, MM exclusion, fail-open
on missing/malformed metadata, latency-still-fires-during-grace, custom config override,
and deduped-task min/max scenarios.
@mittalprince mittalprince force-pushed the datapipes-33203-cdc-bootstrap-sla-grace branch from b6c5c4f to 7912403 Compare April 28, 2026 02:47
@mittalprince mittalprince marked this pull request as ready for review April 28, 2026 02:48
Comment thread datastream-server/src/main/java/com/linkedin/datastream/server/EventProducer.java Outdated
Comment thread datastream-server/src/main/java/com/linkedin/datastream/server/EventProducer.java Outdated
Comment thread datastream-server/src/main/java/com/linkedin/datastream/server/EventProducer.java Outdated
Comment thread datastream-server/src/main/java/com/linkedin/datastream/server/EventProducer.java Outdated
- AbstractKafkaBasedConnectorTask: rewrite the EARLIEST-default comment to reflect the
actual reason (reusing existing snapshot/data when no committed offset is found), drop
the misleading "all records including bootstrap" line and the Jira ticket reference.
- EventProducer: gate stream-creation-time parsing on _sourceDatabase != null at
construction so BMM / Inlogs streams skip the work entirely. Move source-database
parsing forward in the constructor so it is available before the grace-period init.
- EventProducer: extract the grace-period check into isWithinSlaGracePeriod() and the
CREATION_MS parsing into parseStreamCreationTimeMs(task). Drop the redundant
_streamCreationTimeMs > 0 guard in the gate; the time arithmetic short-circuits to
false when creation time is 0.
- All 15 tests in TestEventProducer pass.
Comment thread datastream-server/src/main/java/com/linkedin/datastream/server/EventProducer.java Outdated
Comment thread datastream-server/src/main/java/com/linkedin/datastream/server/EventProducer.java Outdated
Comment thread datastream-server/src/main/java/com/linkedin/datastream/server/EventProducer.java Outdated
- isWithinSlaGracePeriod: explicit isCdcSource() check at the top so the CDC-only
semantics are stated in the method rather than implied by the constructor leaving
_streamCreationTimeMs at 0 for non-CDC sources.
- Drop the constructor gate around parseStreamCreationTimeMs; the method-level check
is now the single source of truth. Move _sourceDatabase parsing back to its
original position (no longer needs to be early).
- Reword "bootstrap-catch-up" / "initial bootstrap backlog" to "cdc-catch-up" /
"initial cdc backlog" — the topic is not log-compacted and the consumer is
reading CDC data, not a bootstrap snapshot.
- All 15 tests in TestEventProducer pass.
akshayrai
akshayrai previously approved these changes Apr 30, 2026
Per follow-up review discussion: this base class is no longer used by LinkedIn-internal
CDC connectors (migrated to AbstractXinfraBasedConnectorTask), so changing the OSS
default introduces churn for no behavioural benefit. Keep the original LATEST default
to avoid confusion for downstream OSS consumers; CDC's EARLIEST behaviour is handled
in the internal connector subclass. PR now scoped to the EventProducer SLA grace-period
feature only.
@mittalprince mittalprince changed the title Default to EARLIEST on no-offset for CDC bootstrap and add SLA grace period for new CDC streams Add SLA grace period for newly created CDC streams to avoid false violations during catch-up May 3, 2026
Suppress only the primary SLA pair while a CDC stream is in its grace window; the
alternate-SLA pair (eventsProducedWithinAlternateSla / eventsProducedOutsideAlternateSla,
default 3 minute threshold) keeps emitting throughout. Operators retain a coarser
end-to-end SLA signal during the initial CDC catch-up instead of seeing a gap on
both dashboards.
Updated tests testSlaGraceActiveForNewCdcStream, testLatencyHistogramStillFiresDuringGracePeriod,
and testSlaGraceDedupedTaskAllStreamsNew to assert the alternate-SLA counter is
created during grace while the primary remains suppressed.
During the CDC catch-up grace window, redirect the source-to-destination latency
histogram from eventsLatencyMs to slaExcludedLatencyMs. Lag alerts wired to the
former no longer fire on initial backlog drain; the catch-up curve remains
observable on the dedicated histogram.
Uses the same metric name as PR linkedin#1005 (Manish's per-datastream opt-out) so
operators have a single histogram for any latency intentionally excluded from
SLA scoring, regardless of trigger.
- Add SLA_EXCLUDED_LATENCY_MS_STRING constant.
- Pick the histogram name in reportMetrics based on isWithinSlaGracePeriod().
- Register slaExcludedLatencyMs in getMetricInfos with the same percentiles as
eventsLatencyMs.
- Updated testLatencyHistogramStillFiresDuringGracePeriod (renamed to
testLatencyHistogramRedirectedToSlaExcludedDuringGracePeriod) to assert the
redirect.
- Added testLatencyHistogramFiresOnPrimaryAfterGracePeriod to assert post-grace
observations go back to eventsLatencyMs.
…d shouldEmitSlaMetric wrapper

- During grace, suppress both the regular primary and alternate SLA counter pairs.
Alternate threshold evaluation is now redirected to slaExcludedWithinAlternateSla /
slaExcludedOutsideAlternateSla so operators can still track grace-period stats on a
dedicated counter pair without polluting the regular alternate-SLA dashboard.
- Add shouldEmitSlaMetric() wrapper around isWithinSlaGracePeriod() so callers gate
on a single function. Future SLA-suppression conditions (e.g. per-datastream opt-out
flags) can be ORed in here without touching call sites.
- reportMetrics now uses shouldEmitSlaMetric() to pick the latency histogram name and
to gate SLA counter emission. Else-branch routes alternate evaluation to the new
slaExcluded counters.
- Register slaExcluded{Within,Outside}AlternateSla in getMetricInfos().
- Update tests so they assert the new routing: regular alt-SLA counter must NOT exist
during grace; slaExcludedWithinAlternateSla must exist with count 1 (latency was
within alt threshold).
Comment thread datastream-server/src/main/java/com/linkedin/datastream/server/EventProducer.java Outdated
Comment thread datastream-server/src/main/java/com/linkedin/datastream/server/EventProducer.java Outdated
Comment thread datastream-server/src/main/java/com/linkedin/datastream/server/EventProducer.java Outdated
Comment thread datastream-server/src/main/java/com/linkedin/datastream/server/EventProducer.java Outdated
…la from grace-period names

- Rename slaExcludedLatencyMs to newlyOnboardedLatencyMs (review comment r3179982442).
This metric is conceptually about the newly-onboarded-stream catch-up window, not
about general SLA exclusion, so the name should reflect its actual role.
- Drop the Sla qualifier from the grace-period field, constants, config key, and
helper method (review comment r3179993056). The same grace window now drives both
SLA suppression and the latency-histogram redirect, so the name should not imply
SLA-only scope:
NEW_STREAM_SLA_GRACE_PERIOD_MS  -> NEW_STREAM_GRACE_PERIOD_MS
DEFAULT_NEW_STREAM_SLA_GRACE_PERIOD_MS -> DEFAULT_NEW_STREAM_GRACE_PERIOD_MS
_newStreamSlaGracePeriodMs       -> _newStreamGracePeriodMs
isWithinSlaGracePeriod()         -> isWithinGracePeriod()
shouldEmitSlaMetric()            -> shouldEmitMetric()
server.properties key newStreamSlaGracePeriodMs -> newStreamGracePeriodMs
- Drop the alternate-SLA tracking redirect during grace (the prior else-branch that
routed alt-SLA evaluation to slaExcluded* counters). Alternate SLA is now simply
suppressed during grace alongside primary, matching the simpler observability model.
- Remove now-unused SLA_EXCLUDED_WITHIN_ALTERNATE_SLA and SLA_EXCLUDED_OUTSIDE_ALTERNATE_SLA
constants and their getMetricInfos() registrations.
- Polish field, javadoc, and inline comments for accuracy now that the gate covers both
SLA and latency emission.
- Tests updated to assert simpler suppress-only behavior; renamed
testLatencyHistogramRedirectedToSlaExcludedDuringGracePeriod to
testLatencyHistogramRedirectedToNewlyOnboardedDuringGracePeriod.
- All 16 tests in TestEventProducer pass under JDK 8.
@mittalprince mittalprince changed the title Add SLA grace period for newly created CDC streams to avoid false violations during catch-up Add grace period for newly created CDC streams (suppress primary/alternate SLA, redirect latency to newlyOnboardedLatencyMs) May 4, 2026
Comment thread datastream-server/src/main/java/com/linkedin/datastream/server/EventProducer.java Outdated
Copy link
Copy Markdown
Collaborator

@khandelwal-ayush khandelwal-ayush left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good now.

harshOSS
harshOSS previously approved these changes May 4, 2026
…kstyle

Open-source repo's checkstyle rule disallows referencing internal ticket names.
The TestEventProducer.java header comment had a leftover internal ticket tag
which slipped through. Removing it.
@mittalprince mittalprince dismissed stale reviews from harshOSS and khandelwal-ayush via 038de16 May 4, 2026 09:43
harshOSS
harshOSS previously approved these changes May 4, 2026
Histogram constant: NEWLY_ONBOARDED_LATENCY_MS_STRING -> EVENTS_LATENCY_MS_SLA_INELIGIBLE_STRING.
Histogram value: newlyOnboardedLatencyMs -> eventsLatencyMsSlaIneligible.
Test method renamed accordingly.
The new name describes the metric's role (latency observations the SLA path is ineligible to
score) rather than the trigger (newly onboarded). Aligns with internal naming preference; the
metric still serves the same purpose - holds redirected latency while shouldEmitMetric() is
suppressing primary/alternate SLA emission for newly created CDC streams.
@khandelwal-ayush khandelwal-ayush merged commit 49ef3be into linkedin:master May 4, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants