Skip to content

Add per-datastream SLA metric opt-out via system.disableSlaMetric #1005

Merged
manishGoyalCode merged 11 commits into
linkedin:masterfrom
manishGoyalCode:mg/IntroduceSLAExcludeDS
May 5, 2026
Merged

Add per-datastream SLA metric opt-out via system.disableSlaMetric #1005
manishGoyalCode merged 11 commits into
linkedin:masterfrom
manishGoyalCode:mg/IntroduceSLAExcludeDS

Conversation

@manishGoyalCode
Copy link
Copy Markdown
Collaborator

@manishGoyalCode manishGoyalCode commented Apr 29, 2026

Summary

  • Introduces a new datastream metadata flag system.disableSlaMetric that lets individual streams opt out of SLA metric emission.
  • When enabled, EventProducer suppresses both primary and alternate SLA counters (eventsProducedWithinSla / eventsProducedOutsideSla) and redirects the latency histogram from eventsLatencyMs to eventsLatencyMsSlaIneligible, so lag
    alerts wired to eventsLatencyMs do not fire for opted-out streams while their latency curve remains observable.
  • The flag is read once at construction (getDisableSlaMetric) and gated through the existing shouldEmitMetric() chokepoint, keeping all suppression conditions (grace period + opt-out) in one place.
  • Honored only for non-deduped tasks (single datastream owners). Deduped tasks short-circuit to false so one stream's opt-out cannot suppress eventsLatencyMs for every stream sharing the EventProducer.

Test plan

  • Unit tests cover: opt-out suppresses SLA counters and redirects latency histogram; deduped tasks ignore the flag; default behavior unchanged when flag is absent or false.
  • Verify on a canary stream with system.disableSlaMetric=true that eventsLatencyMs / eventsProducedWithinSla / eventsProducedOutsideSla are not emitted and eventsLatencyMsSlaIneligible is.
  • Verify a regular stream alongside it still emits the standard SLA metrics.

@manishGoyalCode manishGoyalCode changed the title Added sla excluded ds support Add per-datastream opt-out for eventsLatencyMs SLA metric Apr 29, 2026
@manishGoyalCode manishGoyalCode marked this pull request as ready for review April 29, 2026 05:44
Return false up front when the task hosts multiple datastreams, instead
of parsing metadata and then conditionally overriding. Drops the warning
log per review feedback for readability.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
akshayrai
akshayrai previously approved these changes Apr 29, 2026
mittalprince added a commit to mittalprince/brooklin that referenced this pull request May 4, 2026
During the CDC catch-up grace window, redirect the source-to-destination latency
histogram from eventsLatencyMs to slaExcludedLatencyMs. Lag alerts wired to the
former no longer fire on initial backlog drain; the catch-up curve remains
observable on the dedicated histogram.
Uses the same metric name as PR linkedin#1005 (Manish's per-datastream opt-out) so
operators have a single histogram for any latency intentionally excluded from
SLA scoring, regardless of trigger.
- Add SLA_EXCLUDED_LATENCY_MS_STRING constant.
- Pick the histogram name in reportMetrics based on isWithinSlaGracePeriod().
- Register slaExcludedLatencyMs in getMetricInfos with the same percentiles as
eventsLatencyMs.
- Updated testLatencyHistogramStillFiresDuringGracePeriod (renamed to
testLatencyHistogramRedirectedToSlaExcludedDuringGracePeriod) to assert the
redirect.
- Added testLatencyHistogramFiresOnPrimaryAfterGracePeriod to assert post-grace
observations go back to eventsLatencyMs.
khandelwal-ayush pushed a commit that referenced this pull request May 4, 2026
…rnate SLA, redirect latency to newlyOnboardedLatencyMs) (#1004)

* Default to EARLIEST on no-offset for CDC bootstrap and add SLA grace period for new CDC streams

- AbstractKafkaBasedConnectorTask: change handleNoOffsetForPartitionException to default
to EARLIEST instead of LATEST so CDC consumers don't lose bootstrap snapshot + change
events when a partition has no committed offset. Behaviour unchanged when
AUTO_OFFSET_RESET_CONFIG is set explicitly (e.g. KafkaMirrorMaker).
- EventProducer: add newStreamSlaGracePeriodMs (default 2h). New CDC streams have large
initial lag while draining the bootstrap window from EARLIEST, which would otherwise
trigger false SLA violations. SLA counters are suppressed during the grace window;
latency histograms and warning logs continue to fire so observability is preserved.
- Restrict the grace gate to CDC sources via _sourceDatabase != null (single-slash URI
scheme) so MirrorMaker streams report SLA from the first event.
- Use min(creationMs) across deduped datastreams on a task so a freshly deduped stream
cannot suppress SLA on long-running siblings.
- 9 new unit tests in TestEventProducer covering grace ON/OFF, MM exclusion, fail-open
on missing/malformed metadata, latency-still-fires-during-grace, custom config override,
and deduped-task min/max scenarios.

* Address review feedback

- AbstractKafkaBasedConnectorTask: rewrite the EARLIEST-default comment to reflect the
actual reason (reusing existing snapshot/data when no committed offset is found), drop
the misleading "all records including bootstrap" line and the Jira ticket reference.
- EventProducer: gate stream-creation-time parsing on _sourceDatabase != null at
construction so BMM / Inlogs streams skip the work entirely. Move source-database
parsing forward in the constructor so it is available before the grace-period init.
- EventProducer: extract the grace-period check into isWithinSlaGracePeriod() and the
CREATION_MS parsing into parseStreamCreationTimeMs(task). Drop the redundant
_streamCreationTimeMs > 0 guard in the gate; the time arithmetic short-circuits to
false when creation time is 0.
- All 15 tests in TestEventProducer pass.

* Address follow-up review feedback

- isWithinSlaGracePeriod: explicit isCdcSource() check at the top so the CDC-only
semantics are stated in the method rather than implied by the constructor leaving
_streamCreationTimeMs at 0 for non-CDC sources.
- Drop the constructor gate around parseStreamCreationTimeMs; the method-level check
is now the single source of truth. Move _sourceDatabase parsing back to its
original position (no longer needs to be early).
- Reword "bootstrap-catch-up" / "initial bootstrap backlog" to "cdc-catch-up" /
"initial cdc backlog" — the topic is not log-compacted and the consumer is
reading CDC data, not a bootstrap snapshot.
- All 15 tests in TestEventProducer pass.

* Revert AbstractKafkaBasedConnectorTask change

Per follow-up review discussion: this base class is no longer used by LinkedIn-internal
CDC connectors (migrated to AbstractXinfraBasedConnectorTask), so changing the OSS
default introduces churn for no behavioural benefit. Keep the original LATEST default
to avoid confusion for downstream OSS consumers; CDC's EARLIEST behaviour is handled
in the internal connector subclass. PR now scoped to the EventProducer SLA grace-period
feature only.

* Keep alternate-SLA emission active during grace period

Suppress only the primary SLA pair while a CDC stream is in its grace window; the
alternate-SLA pair (eventsProducedWithinAlternateSla / eventsProducedOutsideAlternateSla,
default 3 minute threshold) keeps emitting throughout. Operators retain a coarser
end-to-end SLA signal during the initial CDC catch-up instead of seeing a gap on
both dashboards.
Updated tests testSlaGraceActiveForNewCdcStream, testLatencyHistogramStillFiresDuringGracePeriod,
and testSlaGraceDedupedTaskAllStreamsNew to assert the alternate-SLA counter is
created during grace while the primary remains suppressed.

* Redirect eventsLatencyMs to slaExcludedLatencyMs during grace period

During the CDC catch-up grace window, redirect the source-to-destination latency
histogram from eventsLatencyMs to slaExcludedLatencyMs. Lag alerts wired to the
former no longer fire on initial backlog drain; the catch-up curve remains
observable on the dedicated histogram.
Uses the same metric name as PR #1005 (Manish's per-datastream opt-out) so
operators have a single histogram for any latency intentionally excluded from
SLA scoring, regardless of trigger.
- Add SLA_EXCLUDED_LATENCY_MS_STRING constant.
- Pick the histogram name in reportMetrics based on isWithinSlaGracePeriod().
- Register slaExcludedLatencyMs in getMetricInfos with the same percentiles as
eventsLatencyMs.
- Updated testLatencyHistogramStillFiresDuringGracePeriod (renamed to
testLatencyHistogramRedirectedToSlaExcludedDuringGracePeriod) to assert the
redirect.
- Added testLatencyHistogramFiresOnPrimaryAfterGracePeriod to assert post-grace
observations go back to eventsLatencyMs.

* Route alternate-SLA emission to slaExcluded counters during grace, add shouldEmitSlaMetric wrapper

- During grace, suppress both the regular primary and alternate SLA counter pairs.
Alternate threshold evaluation is now redirected to slaExcludedWithinAlternateSla /
slaExcludedOutsideAlternateSla so operators can still track grace-period stats on a
dedicated counter pair without polluting the regular alternate-SLA dashboard.
- Add shouldEmitSlaMetric() wrapper around isWithinSlaGracePeriod() so callers gate
on a single function. Future SLA-suppression conditions (e.g. per-datastream opt-out
flags) can be ORed in here without touching call sites.
- reportMetrics now uses shouldEmitSlaMetric() to pick the latency histogram name and
to gate SLA counter emission. Else-branch routes alternate evaluation to the new
slaExcluded counters.
- Register slaExcluded{Within,Outside}AlternateSla in getMetricInfos().
- Update tests so they assert the new routing: regular alt-SLA counter must NOT exist
during grace; slaExcludedWithinAlternateSla must exist with count 1 (latency was
within alt threshold).

* Address review feedback: rename to newlyOnboardedLatencyMs and drop Sla from grace-period names

- Rename slaExcludedLatencyMs to newlyOnboardedLatencyMs (review comment r3179982442).
This metric is conceptually about the newly-onboarded-stream catch-up window, not
about general SLA exclusion, so the name should reflect its actual role.
- Drop the Sla qualifier from the grace-period field, constants, config key, and
helper method (review comment r3179993056). The same grace window now drives both
SLA suppression and the latency-histogram redirect, so the name should not imply
SLA-only scope:
NEW_STREAM_SLA_GRACE_PERIOD_MS  -> NEW_STREAM_GRACE_PERIOD_MS
DEFAULT_NEW_STREAM_SLA_GRACE_PERIOD_MS -> DEFAULT_NEW_STREAM_GRACE_PERIOD_MS
_newStreamSlaGracePeriodMs       -> _newStreamGracePeriodMs
isWithinSlaGracePeriod()         -> isWithinGracePeriod()
shouldEmitSlaMetric()            -> shouldEmitMetric()
server.properties key newStreamSlaGracePeriodMs -> newStreamGracePeriodMs
- Drop the alternate-SLA tracking redirect during grace (the prior else-branch that
routed alt-SLA evaluation to slaExcluded* counters). Alternate SLA is now simply
suppressed during grace alongside primary, matching the simpler observability model.
- Remove now-unused SLA_EXCLUDED_WITHIN_ALTERNATE_SLA and SLA_EXCLUDED_OUTSIDE_ALTERNATE_SLA
constants and their getMetricInfos() registrations.
- Polish field, javadoc, and inline comments for accuracy now that the gate covers both
SLA and latency emission.
- Tests updated to assert simpler suppress-only behavior; renamed
testLatencyHistogramRedirectedToSlaExcludedDuringGracePeriod to
testLatencyHistogramRedirectedToNewlyOnboardedDuringGracePeriod.
- All 16 tests in TestEventProducer pass under JDK 8.

* Drop internal ticket reference from TestEventProducer to satisfy checkstyle

Open-source repo's checkstyle rule disallows referencing internal ticket names.
The TestEventProducer.java header comment had a leftover internal ticket tag
which slipped through. Removing it.

* Rename newlyOnboardedLatencyMs to eventsLatencyMsSlaIneligible

Histogram constant: NEWLY_ONBOARDED_LATENCY_MS_STRING -> EVENTS_LATENCY_MS_SLA_INELIGIBLE_STRING.
Histogram value: newlyOnboardedLatencyMs -> eventsLatencyMsSlaIneligible.
Test method renamed accordingly.
The new name describes the metric's role (latency observations the SLA path is ineligible to
score) rather than the trigger (newly onboarded). Aligns with internal naming preference; the
metric still serves the same purpose - holds redirected latency while shouldEmitMetric() is
suppressing primary/alternate SLA emission for newly created CDC streams.

---------

Co-authored-by: mittalprince <26388073+mittalprince@users.noreply.github.com>
@manishGoyalCode manishGoyalCode changed the title Add per-datastream opt-out for eventsLatencyMs SLA metric Add per-datastream SLA metric opt-out via system.disableSlaMetric May 4, 2026
akshayrai
akshayrai previously approved these changes May 4, 2026
Copy link
Copy Markdown
Collaborator

@kanishkjaiswal2015 kanishkjaiswal2015 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@manishGoyalCode manishGoyalCode merged commit 5ef2d40 into linkedin:master May 5, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants