Add grace period for newly created CDC streams (suppress primary/alternate SLA, redirect latency to newlyOnboardedLatencyMs)#1004
Merged
khandelwal-ayush merged 10 commits intoMay 4, 2026
Conversation
…period for new CDC streams - AbstractKafkaBasedConnectorTask: change handleNoOffsetForPartitionException to default to EARLIEST instead of LATEST so CDC consumers don't lose bootstrap snapshot + change events when a partition has no committed offset. Behaviour unchanged when AUTO_OFFSET_RESET_CONFIG is set explicitly (e.g. KafkaMirrorMaker). - EventProducer: add newStreamSlaGracePeriodMs (default 2h). New CDC streams have large initial lag while draining the bootstrap window from EARLIEST, which would otherwise trigger false SLA violations. SLA counters are suppressed during the grace window; latency histograms and warning logs continue to fire so observability is preserved. - Restrict the grace gate to CDC sources via _sourceDatabase != null (single-slash URI scheme) so MirrorMaker streams report SLA from the first event. - Use min(creationMs) across deduped datastreams on a task so a freshly deduped stream cannot suppress SLA on long-running siblings. - 9 new unit tests in TestEventProducer covering grace ON/OFF, MM exclusion, fail-open on missing/malformed metadata, latency-still-fires-during-grace, custom config override, and deduped-task min/max scenarios.
b6c5c4f to
7912403
Compare
akshayrai
reviewed
Apr 28, 2026
- AbstractKafkaBasedConnectorTask: rewrite the EARLIEST-default comment to reflect the actual reason (reusing existing snapshot/data when no committed offset is found), drop the misleading "all records including bootstrap" line and the Jira ticket reference. - EventProducer: gate stream-creation-time parsing on _sourceDatabase != null at construction so BMM / Inlogs streams skip the work entirely. Move source-database parsing forward in the constructor so it is available before the grace-period init. - EventProducer: extract the grace-period check into isWithinSlaGracePeriod() and the CREATION_MS parsing into parseStreamCreationTimeMs(task). Drop the redundant _streamCreationTimeMs > 0 guard in the gate; the time arithmetic short-circuits to false when creation time is 0. - All 15 tests in TestEventProducer pass.
akshayrai
reviewed
Apr 30, 2026
- isWithinSlaGracePeriod: explicit isCdcSource() check at the top so the CDC-only semantics are stated in the method rather than implied by the constructor leaving _streamCreationTimeMs at 0 for non-CDC sources. - Drop the constructor gate around parseStreamCreationTimeMs; the method-level check is now the single source of truth. Move _sourceDatabase parsing back to its original position (no longer needs to be early). - Reword "bootstrap-catch-up" / "initial bootstrap backlog" to "cdc-catch-up" / "initial cdc backlog" — the topic is not log-compacted and the consumer is reading CDC data, not a bootstrap snapshot. - All 15 tests in TestEventProducer pass.
akshayrai
previously approved these changes
Apr 30, 2026
Per follow-up review discussion: this base class is no longer used by LinkedIn-internal CDC connectors (migrated to AbstractXinfraBasedConnectorTask), so changing the OSS default introduces churn for no behavioural benefit. Keep the original LATEST default to avoid confusion for downstream OSS consumers; CDC's EARLIEST behaviour is handled in the internal connector subclass. PR now scoped to the EventProducer SLA grace-period feature only.
Suppress only the primary SLA pair while a CDC stream is in its grace window; the alternate-SLA pair (eventsProducedWithinAlternateSla / eventsProducedOutsideAlternateSla, default 3 minute threshold) keeps emitting throughout. Operators retain a coarser end-to-end SLA signal during the initial CDC catch-up instead of seeing a gap on both dashboards. Updated tests testSlaGraceActiveForNewCdcStream, testLatencyHistogramStillFiresDuringGracePeriod, and testSlaGraceDedupedTaskAllStreamsNew to assert the alternate-SLA counter is created during grace while the primary remains suppressed.
During the CDC catch-up grace window, redirect the source-to-destination latency histogram from eventsLatencyMs to slaExcludedLatencyMs. Lag alerts wired to the former no longer fire on initial backlog drain; the catch-up curve remains observable on the dedicated histogram. Uses the same metric name as PR linkedin#1005 (Manish's per-datastream opt-out) so operators have a single histogram for any latency intentionally excluded from SLA scoring, regardless of trigger. - Add SLA_EXCLUDED_LATENCY_MS_STRING constant. - Pick the histogram name in reportMetrics based on isWithinSlaGracePeriod(). - Register slaExcludedLatencyMs in getMetricInfos with the same percentiles as eventsLatencyMs. - Updated testLatencyHistogramStillFiresDuringGracePeriod (renamed to testLatencyHistogramRedirectedToSlaExcludedDuringGracePeriod) to assert the redirect. - Added testLatencyHistogramFiresOnPrimaryAfterGracePeriod to assert post-grace observations go back to eventsLatencyMs.
…d shouldEmitSlaMetric wrapper
- During grace, suppress both the regular primary and alternate SLA counter pairs.
Alternate threshold evaluation is now redirected to slaExcludedWithinAlternateSla /
slaExcludedOutsideAlternateSla so operators can still track grace-period stats on a
dedicated counter pair without polluting the regular alternate-SLA dashboard.
- Add shouldEmitSlaMetric() wrapper around isWithinSlaGracePeriod() so callers gate
on a single function. Future SLA-suppression conditions (e.g. per-datastream opt-out
flags) can be ORed in here without touching call sites.
- reportMetrics now uses shouldEmitSlaMetric() to pick the latency histogram name and
to gate SLA counter emission. Else-branch routes alternate evaluation to the new
slaExcluded counters.
- Register slaExcluded{Within,Outside}AlternateSla in getMetricInfos().
- Update tests so they assert the new routing: regular alt-SLA counter must NOT exist
during grace; slaExcludedWithinAlternateSla must exist with count 1 (latency was
within alt threshold).
…la from grace-period names - Rename slaExcludedLatencyMs to newlyOnboardedLatencyMs (review comment r3179982442). This metric is conceptually about the newly-onboarded-stream catch-up window, not about general SLA exclusion, so the name should reflect its actual role. - Drop the Sla qualifier from the grace-period field, constants, config key, and helper method (review comment r3179993056). The same grace window now drives both SLA suppression and the latency-histogram redirect, so the name should not imply SLA-only scope: NEW_STREAM_SLA_GRACE_PERIOD_MS -> NEW_STREAM_GRACE_PERIOD_MS DEFAULT_NEW_STREAM_SLA_GRACE_PERIOD_MS -> DEFAULT_NEW_STREAM_GRACE_PERIOD_MS _newStreamSlaGracePeriodMs -> _newStreamGracePeriodMs isWithinSlaGracePeriod() -> isWithinGracePeriod() shouldEmitSlaMetric() -> shouldEmitMetric() server.properties key newStreamSlaGracePeriodMs -> newStreamGracePeriodMs - Drop the alternate-SLA tracking redirect during grace (the prior else-branch that routed alt-SLA evaluation to slaExcluded* counters). Alternate SLA is now simply suppressed during grace alongside primary, matching the simpler observability model. - Remove now-unused SLA_EXCLUDED_WITHIN_ALTERNATE_SLA and SLA_EXCLUDED_OUTSIDE_ALTERNATE_SLA constants and their getMetricInfos() registrations. - Polish field, javadoc, and inline comments for accuracy now that the gate covers both SLA and latency emission. - Tests updated to assert simpler suppress-only behavior; renamed testLatencyHistogramRedirectedToSlaExcludedDuringGracePeriod to testLatencyHistogramRedirectedToNewlyOnboardedDuringGracePeriod. - All 16 tests in TestEventProducer pass under JDK 8.
khandelwal-ayush
previously approved these changes
May 4, 2026
Collaborator
khandelwal-ayush
left a comment
There was a problem hiding this comment.
Looks good now.
harshOSS
previously approved these changes
May 4, 2026
…kstyle Open-source repo's checkstyle rule disallows referencing internal ticket names. The TestEventProducer.java header comment had a leftover internal ticket tag which slipped through. Removing it.
038de16
harshOSS
previously approved these changes
May 4, 2026
Histogram constant: NEWLY_ONBOARDED_LATENCY_MS_STRING -> EVENTS_LATENCY_MS_SLA_INELIGIBLE_STRING. Histogram value: newlyOnboardedLatencyMs -> eventsLatencyMsSlaIneligible. Test method renamed accordingly. The new name describes the metric's role (latency observations the SLA path is ineligible to score) rather than the trigger (newly onboarded). Aligns with internal naming preference; the metric still serves the same purpose - holds redirected latency while shouldEmitMetric() is suppressing primary/alternate SLA emission for newly created CDC streams.
akshayrai
approved these changes
May 4, 2026
khandelwal-ayush
approved these changes
May 4, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
newStreamGracePeriodMsconfig inEventProducer(default 2h, operator-overridable viabrooklin.server.eventProducer.newStreamGracePeriodMsinserver.properties). While newly created CDC streams drain their initial CDC backlog from EARLIEST, source-to-destination latencies are inherently large;without a grace window every event would be reported as primary-SLA-violating and lag alerts wired to
eventsLatencyMswould fire.eventsProducedWithinSla/eventsProducedOutsideSla) is suppressed.eventsProducedWithinAlternateSla/eventsProducedOutsideAlternateSla) is suppressed.eventsLatencyMsis redirected toeventsLatencyMsSlaIneligibleso lag alerts wired toeventsLatencyMsdo not fire on initial backlog drain. The catch-up curve stays observable on the dedicated histogram.isCdcSource()check (single-slash-URI heuristic, the same one used by the throughput-attribution metrics). MirrorMaker (kafka://) and other non-CDC sources are unaffected — neither SLA suppression nor the latency-histogram redirect applies.DatastreamTaskvia dedup, use the oldest creation time across the task so a freshly deduped stream cannot suppress metrics on long-running siblings.shouldEmitMetric()wrapper. Future suppression conditions (e.g. per-datastream opt-out flags) should be ORed into that one method so call sites stay simple.Notes / scope
AbstractKafkaBasedConnectorTaskfrom LATEST to EARLIEST, but that change has been reverted (390132ee). LinkedIn-internal CDC connectors no longer use this base class (migrated toAbstractXinfraBasedConnectorTask); CDC's EARLIESTbehaviour is handled in the internal connector subclass.
slaExcludedLatencyMsfor a per-datastream opt-out flag (permanent, manual). This PR introducesnewlyOnboardedLatencyMsfor an automatic time-bounded grace window for new CDC streams. Different histograms, different triggers, no merge conflict.Test plan
TestEventProducercovering:CREATION_MSCREATION_MSis missing or malformedeventsLatencyMsredirected tonewlyOnboardedLatencyMsduring graceeventsLatencyMsfires (andnewlyOnboardedLatencyMsdoes not) post-gracenewStreamGracePeriodMsconfig override honouredTestEventProducerpass under JDK 8 (project's required toolchain):JAVA_HOME=<jdk8> ./gradlew :datastream-server:test --tests com.linkedin.datastream.server.TestEventProducer