Skip to content

Conversation

@peytonr18
Copy link
Contributor

@peytonr18 peytonr18 commented Oct 15, 2025

This PR improves azure-init's tracing telemetry by ensuring proper span context is preserved in Hyper-V KVP diagnostics and refining span names for better observability. It also introduces a new locking mechanism in the event of concurrent KVP writing.

Background: Spans vs Events in azure-init

In our tracing architecture:

  • Spans represent units of work (typically functions) and create hierarchical context. They're created via the #[instrument] macro on functions or manually via tracing::info_span!().
  • Events are individual log messages emitted within spans using tracing::info!(), tracing::warn!(), etc.
    When a span is active, all events emitted within it inherit the span's context, creating a hierarchical trace. This allows support engineers to understand the call stack and execution flow when diagnosing issues.

The Problem

Our KVP filter was inadvertently filtering out span context, causing telemetry to lose important hierarchical information:

// Before (in default_kvp_filter)
[
   "WARN",
   "azure_init=INFO",  // Only this
   "libazureinit::config::success",
   // ... specific targets
]

Key realization: Tracing filters apply to both spans AND events. The filter azure_init=INFO allowed INFO-level events from the azure_init crate, but libazureinit=INFO was missing, which caused:

  1. Spans from libazureinit functions (created by #[instrument]) to be filtered out.
  2. Loss of span context for events, even if those specific events matched their custom target (e.g., libazureinit::config::success). This resulted in events being associated with the root span rather than the span created for a given function via the #[instrument] macro.

Example:

azure-init-0.1.1|026a3880-22aa-4f11-a31a-81f6e0c67cbb|INFO|root|e3704361-cbb1-4118-b91a-621bd79c6d64,Time: 2025-10-13T21:11:35.423Z | Event: message=Provisioning complete. File created: /var/lib/azure-init/026a3880-22aa-4f11-a31a-81f6e0c67cbb.provisioned`

now, we see mark_provisioning_complete and it's span_id instead of root and it's id:

azure-init-0.1.1|4dc8e7f4-4f34-448c-bfd4-c0e2bba28629|INFO|mark_provisioning_complete|f2a495bf-b1f2-402a-aa02-665b428e9039Time: 2025-10-16T00:42:55.685Z | Event: message=Provisioning complete. File created: /var/lib/azure-init/4dc8e7f4-4f34-448c-bfd4-c0e2bba28629.provisioned

By adding libazureinit=INFO to the filter, we now:

  • Preserve span hierarchy and span metadata in telemetry logs.
  • Still filter out events unless explicitly listed in the default_kvp_filter while keeping INFO-level spans from libazureinit functions.

Additional Changes:

  • Added missing #[instrument] macro to functions throughout libazureinit with tracing! calls.
  • Manually renamed some spans to better describe their behavior (ex: get --> empty_http_get
  • Renamed the root span to match its function name and library name azure_init_provision.
  • Created short-lived azure_init_run spans for pre-provision telemetry to keep azure_init_provision as the root / top hierarchy span.
  • Updated both stderr and the azure-init.log file layers use FmtSpan::NEW | FmtSpan::CLOSE, which emits a log entry when a span is created (function entered) and a span is closed (function returns) and provides extra visibility into the call flow and timing.

Locking Functionality:

Concept:

Inside azure-init, we funnel all telemetry through a single UnboundedReceiver, so one task serializes every append and we never have in-process contention.

Currently, there is a gap once external crates or other tools start writing to the same KVP pool. The .append doesn't guarantee whole-record writes across processes, so two writers could still corrupt entries. Adding the advisory lock forces every cooperating writer to take a turn writing to the file, giving the ability for atomic writes even when multiple processes are active.

Changes:

  • All telemetry messages still flow through a single UnboundedReceiver, so only one task inside this process ever appends to the KVP file; batching is only about reducing lock churn, not fixing in-process contention.
  • A new write_kvps helper collects whatever is pending, takes the advisory lock once, writes the entire batch, flushes once, and releases the lock. Both the steady-state loop and the shutdown drain call into it.
  • The advisory lock defends against other concurrent writers interleaving records, keeping the file consistent even with append-only semantics.
  • If writing or flushing fails, it logs any unlock error for diagnostics but still propagate the original I/O failure, ensuring the file is never stranded in a locked state.

Related to #239

Proof of Changes

Section fromkvp_pool_1:

azure-init-0.1.1|4dc8e7f4-4f34-448c-bfd4-c0e2bba28629|INFO|useradd|01cfff31-9205-4794-9bca-e47fc98e1c00Time: 2025-10-16T00:42:55.643Z | Event: message=Creating user with username: 'azureuser'

azure-init-0.1.1|4dc8e7f4-4f34-448c-bfd4-c0e2bba28629|INFO|libazureinit_subprocess|f34a35d8-6876-4970-aeb6-77ff24372ad9Start: 2025-10-16T00:42:55.643Z | End: 2025-10-16T00:42:55.669Z

azure-init-0.1.1|4dc8e7f4-4f34-448c-bfd4-c0e2bba28629|INFO|useradd|1fe5cf04-df55-499e-a0c9-e369384f1cf9Start: 2025-10-16T00:42:55.641Z | End: 2025-10-16T00:42:55.669Z

azure-init-0.1.1|4dc8e7f4-4f34-448c-bfd4-c0e2bba28629|INFO|libazureinit_subprocess|ecdb4ca4-dacb-4cda-923d-7122bf34ab8aStart: 2025-10-16T00:42:55.670Z | End: 2025-10-16T00:42:55.676Z

azure-init-0.1.1|4dc8e7f4-4f34-448c-bfd4-c0e2bba28629|INFO|lock_user|164cdd9c-60d5-4044-95ba-4d970de54822Time: 2025-10-16T00:42:55.676Z | Event: message=Locked account via passwd -l, username=azureuser

azure-init-0.1.1|4dc8e7f4-4f34-448c-bfd4-c0e2bba28629|INFO|lock_user|65df7542-3b6c-4666-b585-f11786b1338aStart: 2025-10-16T00:42:55.670Z | End: 2025-10-16T00:42:55.676Z

azure-init-0.1.1|4dc8e7f4-4f34-448c-bfd4-c0e2bba28629|INFO|passwd|7612e9d7-f8f0-48e6-8275-6499764fa97bStart: 2025-10-16T00:42:55.670Z | End: 2025-10-16T00:42:55.676Z

@peytonr18 peytonr18 force-pushed the probertson-clean-kvp branch from 24ac9ef to 9168733 Compare October 15, 2025 22:11
@peytonr18 peytonr18 force-pushed the probertson-clean-kvp branch from 9168733 to 21e48ca Compare October 15, 2025 22:15
…filter from filtering out INFO spans and therefore missing the current span context.
@peytonr18 peytonr18 force-pushed the probertson-clean-kvp branch 2 times, most recently from 31873ff to afda077 Compare October 16, 2025 00:37
@peytonr18 peytonr18 force-pushed the probertson-clean-kvp branch from afda077 to 8645526 Compare October 16, 2025 00:54
@peytonr18 peytonr18 marked this pull request as ready for review October 16, 2025 01:07
Copy link
Contributor

@cadejacobson cadejacobson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your changes look awesome! When reading the logs, I had a bit of a pause when seeing the word provision and thought it was from another location. This made me realize that there are two provision functions, which may get confusing in the future.

Also, great explanation in the PR description!!

@KsenijaS
Copy link

PR descriptions suggests that KVP filter is not working as intended, will there be any changes made to the KVP filter?

@peytonr18
Copy link
Contributor Author

PR descriptions suggests that KVP filter is not working as intended, will there be any changes made to the KVP filter?

Yes! Technically, the default_kvp_filter was working a little too well and filtering out too much. I added azure_init=INFO to the filter, which allows all INFO spans through. That was the missing piece - not allowing those spans through to the EmitKVPLayer.

Copy link
Contributor

@cadejacobson cadejacobson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks awesome! LGTM

)
}

#[instrument(name = "load_config", skip_all)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I second Cade's appreciation for the thorough explanation in the PR description.

The only thing that confuses me a bit is the naming scheme. Any reason why not to just have the names match the path to the functions? The surrounding span context should provide enough clues for more ambiguous function names when looking at a trace?

It indicates to me that we could do a better job naming functions...?

Copy link
Contributor Author

@peytonr18 peytonr18 Oct 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, the full path to the function doesn't show up in the spans, which was partially due to wanting to keep additional information as short as possible with the character limit. I renamed some of the spans because I felt like they were too vague to really be helpful without some added context.

ie: it shows get instead of http::get in kvp_pool_1. The azure-init logs give the full path though, so I know it's possible to do.

I didn't want to do too much in renaming functions but... I wouldn't be opposed to doing that instead of having to manually rename the spans! Or, I could try to get the full path as the span name instead of just the function name, whichever you think is best.

Copy link
Contributor

@anhvoms anhvoms left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@peytonr18 peytonr18 force-pushed the probertson-clean-kvp branch 3 times, most recently from 1ea66e0 to 9362c2a Compare October 28, 2025 23:13
@peytonr18 peytonr18 force-pushed the probertson-clean-kvp branch from 4fd1f4f to a07652a Compare October 28, 2025 23:26
@cjp256
Copy link
Contributor

cjp256 commented Oct 29, 2025

Looks great, just one minor nit before merging.

…l data corruption from concurrent access. The lock is acquired before each

write operation and released after flushing.
// close the channel and drain remaining messages.
events.close();
while let Some(encoded_kvp) = events.recv().await {
if let Err(e) = FileExt::lock_exclusive(&file) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given the duplicate pattern here I suspect there should be write_kvps() method that handles draining available kvps while only taking the lock and calling flush once

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

otherwise everything LGTM

@peytonr18 peytonr18 merged commit 80a939b into Azure:main Nov 13, 2025
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants