Skip to content

lib,src,test: fix race during tracing toggles#441

Open
santigimeno wants to merge 2 commits intonode-v24.x-nsolid-v6.xfrom
santi/tracing_crash_enable_disable
Open

lib,src,test: fix race during tracing toggles#441
santigimeno wants to merge 2 commits intonode-v24.x-nsolid-v6.xfrom
santi/tracing_crash_enable_disable

Conversation

@santigimeno
Copy link
Member

@santigimeno santigimeno commented Mar 19, 2026

Keep JS tracing decisions tied to the env-local tracing state materialized by the native toggle path instead of rereading live flags in multiple places.

Tracing flags can be updated from any thread, so reading them directly during span creation can observe a mid-flight toggle. That could make Tracer.startSpan() return a NonRecordingSpan after the caller had already passed an internal tracing gate, leading to crashes like TypeError: span._pushSpanDataString is not a function.

Pass the full tracing bitmask from EnvList::update_tracing_flags() into JS, cache it in lib/internal/nsolid_trace.js, and make the internal OTel code consume that cached per-thread state. This removes the split-brain behavior where callback binding/unbinding followed flagsUpdated but Tracer.startSpan() could independently observe a later disable and return a NonRecordingSpan.

For internal spans, stop Tracer.startSpan() from rechecking the current trace flags after the caller has already crossed an internal tracing gate. That keeps internal span creation locally consistent while tracing is being enabled and disabled and avoids crashes in code paths that expect a real N|Solid span object.

Also add targeted repro coverage for the toggle race:

  • extend the tracing test addon with configurable setupTracing(flags), stopTracing(), and skipExpectedTracesCheck()
  • add a deterministic fetch-based repro that toggles HTTP client tracing while concurrent fetch() traffic is in flight
  • add a grpc-based repro harness for tracing reconfiguration and cover both fetch and http traffic
  • teach the grpc agent test client how to generate fetch transactions

Together these changes make trace flags materialize per env/thread in JS, preserve the caller's local tracing decision for internal spans, and add regression coverage for the tracing enable/disable race.

Summary by CodeRabbit

  • Refactor

    • Improved trace flag caching so tracing toggles now propagate updated flags to runtime components.
    • Adjusted span creation logic so internal spans bypass global trace-disable gating.
  • Tests

    • Added race condition tests for concurrent HTTP fetch and gRPC tracing scenarios.
    • Enhanced test tooling with fetch transaction support and runtime trace control utilities.

@santigimeno santigimeno requested a review from RafaelGSS March 19, 2026 15:38
@santigimeno santigimeno self-assigned this Mar 19, 2026
@coderabbitai
Copy link

coderabbitai bot commented Mar 19, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: f2b824bc-7cfd-49ad-9595-02144bd6d02c

📥 Commits

Reviewing files that changed from the base of the PR and between 03ccece and c6ab0ff.

📒 Files selected for processing (9)
  • lib/internal/nsolid_trace.js
  • lib/internal/otel/core.js
  • lib/internal/otel/trace.js
  • src/nsolid/nsolid_api.cc
  • test/addons/nsolid-tracing/binding.cc
  • test/addons/nsolid-tracing/test-otel-fetch-enable-disable-race.js
  • test/agents/test-grpc-tracing-race.mjs
  • test/common/nsolid-grpc-agent/client.js
  • test/parallel/test-nsolid-file-handle-count.js
🚧 Files skipped from review as they are similar to previous changes (5)
  • lib/internal/nsolid_trace.js
  • test/agents/test-grpc-tracing-race.mjs
  • test/addons/nsolid-tracing/test-otel-fetch-enable-disable-race.js
  • test/addons/nsolid-tracing/binding.cc
  • lib/internal/otel/trace.js

Walkthrough

The diff centralizes trace flags into a cached currentTraceFlags in lib/internal/nsolid_trace.js, exposes getTraceFlags(), changes the native binding toggle callback to pass a uint32 flags value, and updates OTEL/tracer code and tests to use the cached flags and handle flags-updated events with the flags payload.

Changes

Cohort / File(s) Summary
Trace Flags Cache
lib/internal/nsolid_trace.js
Add module-level currentTraceFlags (initialized from binding), use it in span generation, export getTraceFlags(), update toggle callback to accept flags and emit flagsUpdated with payload.
OTEL Integration
lib/internal/otel/core.js, lib/internal/otel/trace.js
Replace direct reads of binding trace flags with getTraceFlags(). Update flagsUpdated handlers to accept flags arg and enable/disable behavior based on it; adjust early-return logic for span creation to consult cached flags and respect options.internal.
Native Binding Change
src/nsolid/nsolid_api.cc
Pass Uint32::New(isolate, flags) into env->nsolid_toggle_tracing_fn() (uint32) instead of a boolean.
Test Binding Infrastructure
test/addons/nsolid-tracing/binding.cc
setupTracing accepts optional uint32 flags, prevent duplicate init, register atexit once, conditional at_exit_cb early-return based on check_expected_traces_; add exported stopTracing() and skipExpectedTracesCheck().
Race Tests (HTTP/fetch & gRPC)
test/addons/nsolid-tracing/test-otel-fetch-enable-disable-race.js, test/agents/test-grpc-tracing-race.mjs
New tests that repeatedly toggle tracing on/off while issuing concurrent fetch/HTTP or gRPC tracing requests to surface race conditions.
gRPC Agent: Fetch Path
test/common/nsolid-grpc-agent/client.js
Add execFetchTransaction() and a msg.kind === 'fetch' path to run many concurrent fetch POST requests as part of tracing tests.
Minor Test Adjustment
test/parallel/test-nsolid-file-handle-count.js
Wrap resolved value handler with explicit lambda to ensure closePromiseFd receives the file handle argument.

Sequence Diagram

sequenceDiagram
    participant NB as NativeBinding
    participant NT as nsolid_trace
    participant OC as OTEL_Core
    participant TR as Tracer

    NB->>NT: toggleTracingFn(flags: uint32)
    activate NT
    NT->>NT: currentTraceFlags = flags
    NT->>OC: emit flagsUpdated(flags)
    deactivate NT

    activate OC
    alt flags > 0
        OC->>OC: enable API/instrumentation
    else flags == 0
        OC->>OC: disable API/instrumentation
    end
    deactivate OC

    TR->>NT: getTraceFlags()
    activate NT
    NT-->>TR: currentTraceFlags
    deactivate NT

    alt !options.internal && flags == 0
        TR->>TR: return invalidSpanContext
    else
        TR->>TR: create span using flags bitmask
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Suggested reviewers

  • RafaelGSS

Poem

🐰 A twitch and a thunk, the flags now stay near,

I cache them with care so the tracer can hear.
Toggles that sprint from the native-side land,
Events carry numbers now—steady and planned.
Fetches and GRPC dance; tests watch the lair. 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main objective of the PR: fixing a race condition that occurs during tracing toggles across multiple files (lib, src, test).

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch santi/tracing_crash_enable_disable
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/common/nsolid-grpc-agent/client.js`:
- Around line 127-129: The CLI top-level trace handling is missing support for
the "fetch" trace, so invoking the helper with args.values.trace === 'fetch'
no-ops; update the CLI flow that checks args.values.trace (the direct path
starting around where args.values.trace is evaluated) to handle 'fetch' by
invoking the same logic as handleTrace() does for fetch (i.e., call
execFetchTransaction() or delegate to handleTrace('fetch')), ensuring the code
paths that currently only handle other trace values also include a branch for
'fetch' so -t fetch triggers the expected behavior.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: f5e7f60a-979e-47db-a74a-f6aa8b1ec80b

📥 Commits

Reviewing files that changed from the base of the PR and between d41bff4 and 03ccece.

📒 Files selected for processing (8)
  • lib/internal/nsolid_trace.js
  • lib/internal/otel/core.js
  • lib/internal/otel/trace.js
  • src/nsolid/nsolid_api.cc
  • test/addons/nsolid-tracing/binding.cc
  • test/addons/nsolid-tracing/test-otel-fetch-enable-disable-race.js
  • test/agents/test-grpc-tracing-race.mjs
  • test/common/nsolid-grpc-agent/client.js

Keep JS tracing decisions tied to the env-local tracing state
materialized by the native toggle path instead of rereading live flags
in multiple places.

Tracing flags can be updated from any thread, so reading them directly
during span creation can observe a mid-flight toggle. That could make
Tracer.startSpan() return a NonRecordingSpan after the caller had
already passed an internal tracing gate, leading to crashes like
TypeError: span._pushSpanDataString is not a function.

Pass the full tracing bitmask from EnvList::update_tracing_flags() into
JS, cache it in lib/internal/nsolid_trace.js, and make the internal
OTel code consume that cached per-thread state. This removes the
split-brain behavior where callback binding/unbinding followed
flagsUpdated but Tracer.startSpan() could independently observe a later
disable and return a NonRecordingSpan.

For internal spans, stop Tracer.startSpan() from rechecking the current
trace flags after the caller has already crossed an internal tracing
gate. That keeps internal span creation locally consistent while
tracing is being enabled and disabled and avoids crashes in code paths
that expect a real N|Solid span object.

Also add targeted repro coverage for the toggle race:
- extend the tracing test addon with configurable setupTracing(flags),
  stopTracing(), and skipExpectedTracesCheck()
- add a deterministic fetch-based repro that toggles HTTP client
  tracing while concurrent fetch() traffic is in flight
- add a grpc-based repro harness for tracing reconfiguration and cover
  both fetch and http traffic
- teach the grpc agent test client how to generate fetch transactions

Together these changes make trace flags materialize per env/thread in
JS, preserve the caller's local tracing decision for internal spans,
and add regression coverage for the tracing enable/disable race.
@santigimeno santigimeno force-pushed the santi/tracing_crash_enable_disable branch from 03ccece to c6ab0ff Compare March 19, 2026 16:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant