Add per-datastream SLA metric opt-out via system.disableSlaMetric #1005
Merged
manishGoyalCode merged 11 commits intoMay 5, 2026
Conversation
added 2 commits
April 29, 2026 11:03
akshayrai
reviewed
Apr 29, 2026
akshayrai
reviewed
Apr 29, 2026
Return false up front when the task hosts multiple datastreams, instead of parsing metadata and then conditionally overriding. Drops the warning log per review feedback for readability. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
akshayrai
previously approved these changes
Apr 29, 2026
mittalprince
added a commit
to mittalprince/brooklin
that referenced
this pull request
May 4, 2026
During the CDC catch-up grace window, redirect the source-to-destination latency histogram from eventsLatencyMs to slaExcludedLatencyMs. Lag alerts wired to the former no longer fire on initial backlog drain; the catch-up curve remains observable on the dedicated histogram. Uses the same metric name as PR linkedin#1005 (Manish's per-datastream opt-out) so operators have a single histogram for any latency intentionally excluded from SLA scoring, regardless of trigger. - Add SLA_EXCLUDED_LATENCY_MS_STRING constant. - Pick the histogram name in reportMetrics based on isWithinSlaGracePeriod(). - Register slaExcludedLatencyMs in getMetricInfos with the same percentiles as eventsLatencyMs. - Updated testLatencyHistogramStillFiresDuringGracePeriod (renamed to testLatencyHistogramRedirectedToSlaExcludedDuringGracePeriod) to assert the redirect. - Added testLatencyHistogramFiresOnPrimaryAfterGracePeriod to assert post-grace observations go back to eventsLatencyMs.
2 tasks
khandelwal-ayush
pushed a commit
that referenced
this pull request
May 4, 2026
…rnate SLA, redirect latency to newlyOnboardedLatencyMs) (#1004) * Default to EARLIEST on no-offset for CDC bootstrap and add SLA grace period for new CDC streams - AbstractKafkaBasedConnectorTask: change handleNoOffsetForPartitionException to default to EARLIEST instead of LATEST so CDC consumers don't lose bootstrap snapshot + change events when a partition has no committed offset. Behaviour unchanged when AUTO_OFFSET_RESET_CONFIG is set explicitly (e.g. KafkaMirrorMaker). - EventProducer: add newStreamSlaGracePeriodMs (default 2h). New CDC streams have large initial lag while draining the bootstrap window from EARLIEST, which would otherwise trigger false SLA violations. SLA counters are suppressed during the grace window; latency histograms and warning logs continue to fire so observability is preserved. - Restrict the grace gate to CDC sources via _sourceDatabase != null (single-slash URI scheme) so MirrorMaker streams report SLA from the first event. - Use min(creationMs) across deduped datastreams on a task so a freshly deduped stream cannot suppress SLA on long-running siblings. - 9 new unit tests in TestEventProducer covering grace ON/OFF, MM exclusion, fail-open on missing/malformed metadata, latency-still-fires-during-grace, custom config override, and deduped-task min/max scenarios. * Address review feedback - AbstractKafkaBasedConnectorTask: rewrite the EARLIEST-default comment to reflect the actual reason (reusing existing snapshot/data when no committed offset is found), drop the misleading "all records including bootstrap" line and the Jira ticket reference. - EventProducer: gate stream-creation-time parsing on _sourceDatabase != null at construction so BMM / Inlogs streams skip the work entirely. Move source-database parsing forward in the constructor so it is available before the grace-period init. - EventProducer: extract the grace-period check into isWithinSlaGracePeriod() and the CREATION_MS parsing into parseStreamCreationTimeMs(task). Drop the redundant _streamCreationTimeMs > 0 guard in the gate; the time arithmetic short-circuits to false when creation time is 0. - All 15 tests in TestEventProducer pass. * Address follow-up review feedback - isWithinSlaGracePeriod: explicit isCdcSource() check at the top so the CDC-only semantics are stated in the method rather than implied by the constructor leaving _streamCreationTimeMs at 0 for non-CDC sources. - Drop the constructor gate around parseStreamCreationTimeMs; the method-level check is now the single source of truth. Move _sourceDatabase parsing back to its original position (no longer needs to be early). - Reword "bootstrap-catch-up" / "initial bootstrap backlog" to "cdc-catch-up" / "initial cdc backlog" — the topic is not log-compacted and the consumer is reading CDC data, not a bootstrap snapshot. - All 15 tests in TestEventProducer pass. * Revert AbstractKafkaBasedConnectorTask change Per follow-up review discussion: this base class is no longer used by LinkedIn-internal CDC connectors (migrated to AbstractXinfraBasedConnectorTask), so changing the OSS default introduces churn for no behavioural benefit. Keep the original LATEST default to avoid confusion for downstream OSS consumers; CDC's EARLIEST behaviour is handled in the internal connector subclass. PR now scoped to the EventProducer SLA grace-period feature only. * Keep alternate-SLA emission active during grace period Suppress only the primary SLA pair while a CDC stream is in its grace window; the alternate-SLA pair (eventsProducedWithinAlternateSla / eventsProducedOutsideAlternateSla, default 3 minute threshold) keeps emitting throughout. Operators retain a coarser end-to-end SLA signal during the initial CDC catch-up instead of seeing a gap on both dashboards. Updated tests testSlaGraceActiveForNewCdcStream, testLatencyHistogramStillFiresDuringGracePeriod, and testSlaGraceDedupedTaskAllStreamsNew to assert the alternate-SLA counter is created during grace while the primary remains suppressed. * Redirect eventsLatencyMs to slaExcludedLatencyMs during grace period During the CDC catch-up grace window, redirect the source-to-destination latency histogram from eventsLatencyMs to slaExcludedLatencyMs. Lag alerts wired to the former no longer fire on initial backlog drain; the catch-up curve remains observable on the dedicated histogram. Uses the same metric name as PR #1005 (Manish's per-datastream opt-out) so operators have a single histogram for any latency intentionally excluded from SLA scoring, regardless of trigger. - Add SLA_EXCLUDED_LATENCY_MS_STRING constant. - Pick the histogram name in reportMetrics based on isWithinSlaGracePeriod(). - Register slaExcludedLatencyMs in getMetricInfos with the same percentiles as eventsLatencyMs. - Updated testLatencyHistogramStillFiresDuringGracePeriod (renamed to testLatencyHistogramRedirectedToSlaExcludedDuringGracePeriod) to assert the redirect. - Added testLatencyHistogramFiresOnPrimaryAfterGracePeriod to assert post-grace observations go back to eventsLatencyMs. * Route alternate-SLA emission to slaExcluded counters during grace, add shouldEmitSlaMetric wrapper - During grace, suppress both the regular primary and alternate SLA counter pairs. Alternate threshold evaluation is now redirected to slaExcludedWithinAlternateSla / slaExcludedOutsideAlternateSla so operators can still track grace-period stats on a dedicated counter pair without polluting the regular alternate-SLA dashboard. - Add shouldEmitSlaMetric() wrapper around isWithinSlaGracePeriod() so callers gate on a single function. Future SLA-suppression conditions (e.g. per-datastream opt-out flags) can be ORed in here without touching call sites. - reportMetrics now uses shouldEmitSlaMetric() to pick the latency histogram name and to gate SLA counter emission. Else-branch routes alternate evaluation to the new slaExcluded counters. - Register slaExcluded{Within,Outside}AlternateSla in getMetricInfos(). - Update tests so they assert the new routing: regular alt-SLA counter must NOT exist during grace; slaExcludedWithinAlternateSla must exist with count 1 (latency was within alt threshold). * Address review feedback: rename to newlyOnboardedLatencyMs and drop Sla from grace-period names - Rename slaExcludedLatencyMs to newlyOnboardedLatencyMs (review comment r3179982442). This metric is conceptually about the newly-onboarded-stream catch-up window, not about general SLA exclusion, so the name should reflect its actual role. - Drop the Sla qualifier from the grace-period field, constants, config key, and helper method (review comment r3179993056). The same grace window now drives both SLA suppression and the latency-histogram redirect, so the name should not imply SLA-only scope: NEW_STREAM_SLA_GRACE_PERIOD_MS -> NEW_STREAM_GRACE_PERIOD_MS DEFAULT_NEW_STREAM_SLA_GRACE_PERIOD_MS -> DEFAULT_NEW_STREAM_GRACE_PERIOD_MS _newStreamSlaGracePeriodMs -> _newStreamGracePeriodMs isWithinSlaGracePeriod() -> isWithinGracePeriod() shouldEmitSlaMetric() -> shouldEmitMetric() server.properties key newStreamSlaGracePeriodMs -> newStreamGracePeriodMs - Drop the alternate-SLA tracking redirect during grace (the prior else-branch that routed alt-SLA evaluation to slaExcluded* counters). Alternate SLA is now simply suppressed during grace alongside primary, matching the simpler observability model. - Remove now-unused SLA_EXCLUDED_WITHIN_ALTERNATE_SLA and SLA_EXCLUDED_OUTSIDE_ALTERNATE_SLA constants and their getMetricInfos() registrations. - Polish field, javadoc, and inline comments for accuracy now that the gate covers both SLA and latency emission. - Tests updated to assert simpler suppress-only behavior; renamed testLatencyHistogramRedirectedToSlaExcludedDuringGracePeriod to testLatencyHistogramRedirectedToNewlyOnboardedDuringGracePeriod. - All 16 tests in TestEventProducer pass under JDK 8. * Drop internal ticket reference from TestEventProducer to satisfy checkstyle Open-source repo's checkstyle rule disallows referencing internal ticket names. The TestEventProducer.java header comment had a leftover internal ticket tag which slipped through. Removing it. * Rename newlyOnboardedLatencyMs to eventsLatencyMsSlaIneligible Histogram constant: NEWLY_ONBOARDED_LATENCY_MS_STRING -> EVENTS_LATENCY_MS_SLA_INELIGIBLE_STRING. Histogram value: newlyOnboardedLatencyMs -> eventsLatencyMsSlaIneligible. Test method renamed accordingly. The new name describes the metric's role (latency observations the SLA path is ineligible to score) rather than the trigger (newly onboarded). Aligns with internal naming preference; the metric still serves the same purpose - holds redirected latency while shouldEmitMetric() is suppressing primary/alternate SLA emission for newly created CDC streams. --------- Co-authored-by: mittalprince <26388073+mittalprince@users.noreply.github.com>
akshayrai
previously approved these changes
May 4, 2026
akshayrai
approved these changes
May 5, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
system.disableSlaMetricthat lets individual streams opt out of SLA metric emission.EventProducersuppresses both primary and alternate SLA counters (eventsProducedWithinSla/eventsProducedOutsideSla) and redirects the latency histogram fromeventsLatencyMstoeventsLatencyMsSlaIneligible, so lagalerts wired to
eventsLatencyMsdo not fire for opted-out streams while their latency curve remains observable.getDisableSlaMetric) and gated through the existingshouldEmitMetric()chokepoint, keeping all suppression conditions (grace period + opt-out) in one place.falseso one stream's opt-out cannot suppresseventsLatencyMsfor every stream sharing theEventProducer.Test plan
false.system.disableSlaMetric=truethateventsLatencyMs/eventsProducedWithinSla/eventsProducedOutsideSlaare not emitted andeventsLatencyMsSlaIneligibleis.