Audit KVP Tracing and Improve Telemetry #255

peytonr18 · 2025-10-15T22:06:09Z

This PR improves azure-init's tracing telemetry by ensuring proper span context is preserved in Hyper-V KVP diagnostics and refining span names for better observability. It also introduces a new locking mechanism in the event of concurrent KVP writing.

Background: Spans vs Events in azure-init

In our tracing architecture:

Spans represent units of work (typically functions) and create hierarchical context. They're created via the #[instrument] macro on functions or manually via tracing::info_span!().
Events are individual log messages emitted within spans using tracing::info!(), tracing::warn!(), etc.
When a span is active, all events emitted within it inherit the span's context, creating a hierarchical trace. This allows support engineers to understand the call stack and execution flow when diagnosing issues.

The Problem

Our KVP filter was inadvertently filtering out span context, causing telemetry to lose important hierarchical information:

// Before (in default_kvp_filter)
[
   "WARN",
   "azure_init=INFO",  // Only this
   "libazureinit::config::success",
   // ... specific targets
]

Key realization: Tracing filters apply to both spans AND events. The filter azure_init=INFO allowed INFO-level events from the azure_init crate, but libazureinit=INFO was missing, which caused:

Spans from libazureinit functions (created by #[instrument]) to be filtered out.
Loss of span context for events, even if those specific events matched their custom target (e.g., libazureinit::config::success). This resulted in events being associated with the root span rather than the span created for a given function via the #[instrument] macro.

Example:

azure-init-0.1.1|026a3880-22aa-4f11-a31a-81f6e0c67cbb|INFO|root|e3704361-cbb1-4118-b91a-621bd79c6d64,Time: 2025-10-13T21:11:35.423Z | Event: message=Provisioning complete. File created: /var/lib/azure-init/026a3880-22aa-4f11-a31a-81f6e0c67cbb.provisioned`

now, we see mark_provisioning_complete and it's span_id instead of root and it's id:

azure-init-0.1.1|4dc8e7f4-4f34-448c-bfd4-c0e2bba28629|INFO|mark_provisioning_complete|f2a495bf-b1f2-402a-aa02-665b428e9039Time: 2025-10-16T00:42:55.685Z | Event: message=Provisioning complete. File created: /var/lib/azure-init/4dc8e7f4-4f34-448c-bfd4-c0e2bba28629.provisioned

By adding libazureinit=INFO to the filter, we now:

Preserve span hierarchy and span metadata in telemetry logs.
Still filter out events unless explicitly listed in the default_kvp_filter while keeping INFO-level spans from libazureinit functions.

Additional Changes:

Added missing #[instrument] macro to functions throughout libazureinit with tracing! calls.
Manually renamed some spans to better describe their behavior (ex: get --> empty_http_get
Renamed the root span to match its function name and library name azure_init_provision.
Created short-lived azure_init_run spans for pre-provision telemetry to keep azure_init_provision as the root / top hierarchy span.
Updated both stderr and the azure-init.log file layers use FmtSpan::NEW | FmtSpan::CLOSE, which emits a log entry when a span is created (function entered) and a span is closed (function returns) and provides extra visibility into the call flow and timing.

Locking Functionality:

Concept:

Inside azure-init, we funnel all telemetry through a single UnboundedReceiver, so one task serializes every append and we never have in-process contention.

Currently, there is a gap once external crates or other tools start writing to the same KVP pool. The .append doesn't guarantee whole-record writes across processes, so two writers could still corrupt entries. Adding the advisory lock forces every cooperating writer to take a turn writing to the file, giving the ability for atomic writes even when multiple processes are active.

Changes:

All telemetry messages still flow through a single UnboundedReceiver, so only one task inside this process ever appends to the KVP file; batching is only about reducing lock churn, not fixing in-process contention.
A new write_kvps helper collects whatever is pending, takes the advisory lock once, writes the entire batch, flushes once, and releases the lock. Both the steady-state loop and the shutdown drain call into it.
The advisory lock defends against other concurrent writers interleaving records, keeping the file consistent even with append-only semantics.
If writing or flushing fails, it logs any unlock error for diagnostics but still propagate the original I/O failure, ensuring the file is never stranded in a locked state.

Related to #239

Proof of Changes

Section fromkvp_pool_1:

azure-init-0.1.1|4dc8e7f4-4f34-448c-bfd4-c0e2bba28629|INFO|useradd|01cfff31-9205-4794-9bca-e47fc98e1c00Time: 2025-10-16T00:42:55.643Z | Event: message=Creating user with username: 'azureuser'

azure-init-0.1.1|4dc8e7f4-4f34-448c-bfd4-c0e2bba28629|INFO|libazureinit_subprocess|f34a35d8-6876-4970-aeb6-77ff24372ad9Start: 2025-10-16T00:42:55.643Z | End: 2025-10-16T00:42:55.669Z

azure-init-0.1.1|4dc8e7f4-4f34-448c-bfd4-c0e2bba28629|INFO|useradd|1fe5cf04-df55-499e-a0c9-e369384f1cf9Start: 2025-10-16T00:42:55.641Z | End: 2025-10-16T00:42:55.669Z

azure-init-0.1.1|4dc8e7f4-4f34-448c-bfd4-c0e2bba28629|INFO|libazureinit_subprocess|ecdb4ca4-dacb-4cda-923d-7122bf34ab8aStart: 2025-10-16T00:42:55.670Z | End: 2025-10-16T00:42:55.676Z

azure-init-0.1.1|4dc8e7f4-4f34-448c-bfd4-c0e2bba28629|INFO|lock_user|164cdd9c-60d5-4044-95ba-4d970de54822Time: 2025-10-16T00:42:55.676Z | Event: message=Locked account via passwd -l, username=azureuser

azure-init-0.1.1|4dc8e7f4-4f34-448c-bfd4-c0e2bba28629|INFO|lock_user|65df7542-3b6c-4666-b585-f11786b1338aStart: 2025-10-16T00:42:55.670Z | End: 2025-10-16T00:42:55.676Z

azure-init-0.1.1|4dc8e7f4-4f34-448c-bfd4-c0e2bba28629|INFO|passwd|7612e9d7-f8f0-48e6-8275-6499764fa97bStart: 2025-10-16T00:42:55.670Z | End: 2025-10-16T00:42:55.676Z

… in the telemetry and not just root

…filter from filtering out INFO spans and therefore missing the current span context.

…ected by adding a minimal span to main

cadejacobson

Your changes look awesome! When reading the logs, I had a bit of a pause when seeing the word provision and thought it was from another location. This made me realize that there are two provision functions, which may get confusing in the future.

Also, great explanation in the PR description!!

src/main.rs

…inguishable

KsenijaS · 2025-10-21T22:13:53Z

PR descriptions suggests that KVP filter is not working as intended, will there be any changes made to the KVP filter?

peytonr18 · 2025-10-21T22:28:10Z

PR descriptions suggests that KVP filter is not working as intended, will there be any changes made to the KVP filter?

Yes! Technically, the default_kvp_filter was working a little too well and filtering out too much. I added azure_init=INFO to the filter, which allows all INFO spans through. That was the missing piece - not allowing those spans through to the EmitKVPLayer.

cadejacobson

Looks awesome! LGTM

cjp256 · 2025-10-22T14:37:33Z

libazureinit/src/config.rs

        )
    }

+    #[instrument(name = "load_config", skip_all)]


I second Cade's appreciation for the thorough explanation in the PR description.

The only thing that confuses me a bit is the naming scheme. Any reason why not to just have the names match the path to the functions? The surrounding span context should provide enough clues for more ambiguous function names when looking at a trace?

It indicates to me that we could do a better job naming functions...?

Currently, the full path to the function doesn't show up in the spans, which was partially due to wanting to keep additional information as short as possible with the character limit. I renamed some of the spans because I felt like they were too vague to really be helpful without some added context.

ie: it shows get instead of http::get in kvp_pool_1. The azure-init logs give the full path though, so I know it's possible to do.

I didn't want to do too much in renaming functions but... I wouldn't be opposed to doing that instead of having to manually rename the spans! Or, I could try to get the full path as the span name instead of just the function name, whichever you think is best.

anhvoms

LGTM!

…telemetry

libazureinit/src/http.rs

cjp256 · 2025-10-29T12:39:23Z

Looks great, just one minor nit before merging.

…l data corruption from concurrent access. The lock is acquired before each write operation and released after flushing.

cjp256 · 2025-11-04T21:55:32Z

libazureinit/src/kvp.rs

                    // close the channel and drain remaining messages.
                    events.close();
                    while let Some(encoded_kvp) = events.recv().await {
+                        if let Err(e) = FileExt::lock_exclusive(&file) {


given the duplicate pattern here I suspect there should be write_kvps() method that handles draining available kvps while only taking the lock and calling flush once

otherwise everything LGTM

…d ensure we always unlock even after write failures.

peytonr18 force-pushed the probertson-clean-kvp branch from 24ac9ef to 9168733 Compare October 15, 2025 22:11

Telemetry audit with the goal of ensuring function names are included…

21e48ca

… in the telemetry and not just root

peytonr18 force-pushed the probertson-clean-kvp branch from 9168733 to 21e48ca Compare October 15, 2025 22:15

peytonr18 added 2 commits October 15, 2025 16:15

Testing filters for events vs. spans; adjusting logic to prevent KVP …

0dedd83

…filter from filtering out INFO spans and therefore missing the current span context.

Rename spans to better describe functionality

54aeb61

peytonr18 force-pushed the probertson-clean-kvp branch 2 times, most recently from 31873ff to afda077 Compare October 16, 2025 00:37

Renaming subprocess span and ensuring pre-provision telemetry is coll…

8645526

…ected by adding a minimal span to main

peytonr18 force-pushed the probertson-clean-kvp branch from afda077 to 8645526 Compare October 16, 2025 00:54

peytonr18 marked this pull request as ready for review October 16, 2025 01:07

cadejacobson approved these changes Oct 16, 2025

View reviewed changes

src/main.rs Show resolved Hide resolved

peytonr18 and others added 2 commits October 21, 2025 14:21

Update main.rs provision span to azure_init_provision to make it dist…

5da7803

…inguishable

Merge branch 'main' into probertson-clean-kvp

e4e8615

KsenijaS approved these changes Oct 21, 2025

View reviewed changes

cadejacobson approved these changes Oct 22, 2025

View reviewed changes

cjp256 reviewed Oct 22, 2025

View reviewed changes

anhvoms approved these changes Oct 22, 2025

View reviewed changes

peytonr18 force-pushed the probertson-clean-kvp branch 3 times, most recently from 1ea66e0 to 9362c2a Compare October 28, 2025 23:13

refactor: standardize span naming with module:function format in KVP …

a07652a

…telemetry

peytonr18 force-pushed the probertson-clean-kvp branch from 4fd1f4f to a07652a Compare October 28, 2025 23:26

Merge branch 'main' into probertson-clean-kvp

fc75c8b

cjp256 reviewed Oct 29, 2025

View reviewed changes

libazureinit/src/http.rs Outdated Show resolved Hide resolved

peytonr18 added 2 commits October 29, 2025 15:34

Revert http_request span name

2615e0f

Add exclusive file locking around KVP file writes to prevent potentia…

411c86a

…l data corruption from concurrent access. The lock is acquired before each write operation and released after flushing.

cjp256 reviewed Nov 4, 2025

View reviewed changes

peytonr18 and others added 2 commits November 5, 2025 13:11

Batch KVP writes under one lock, reuse the helper during shutdown, an…

8679aea

…d ensure we always unlock even after write failures.

Merge branch 'main' into probertson-clean-kvp

1c0154a

peytonr18 requested a review from cadejacobson November 5, 2025 22:23

cadejacobson approved these changes Nov 6, 2025

View reviewed changes

peytonr18 merged commit 80a939b into Azure:main Nov 13, 2025
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Audit KVP Tracing and Improve Telemetry #255

Audit KVP Tracing and Improve Telemetry #255

Uh oh!

peytonr18 commented Oct 15, 2025 •

edited

Loading

Uh oh!

cadejacobson left a comment •

edited

Loading

Uh oh!

Uh oh!

KsenijaS commented Oct 21, 2025

Uh oh!

peytonr18 commented Oct 21, 2025

Uh oh!

cadejacobson left a comment

Uh oh!

cjp256 Oct 22, 2025

Uh oh!

peytonr18 Oct 22, 2025 •

edited

Loading

Uh oh!

anhvoms left a comment

Uh oh!

Uh oh!

cjp256 commented Oct 29, 2025

Uh oh!

cjp256 Nov 4, 2025

Uh oh!

cjp256 Nov 4, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Audit KVP Tracing and Improve Telemetry #255

Audit KVP Tracing and Improve Telemetry #255

Uh oh!

Conversation

peytonr18 commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Background: Spans vs Events in azure-init

The Problem

Additional Changes:

Locking Functionality:

Concept:

Changes:

Proof of Changes

Uh oh!

cadejacobson left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

KsenijaS commented Oct 21, 2025

Uh oh!

peytonr18 commented Oct 21, 2025

Uh oh!

cadejacobson left a comment

Choose a reason for hiding this comment

Uh oh!

cjp256 Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

peytonr18 Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anhvoms left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cjp256 commented Oct 29, 2025

Uh oh!

cjp256 Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

cjp256 Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

peytonr18 commented Oct 15, 2025 •

edited

Loading

cadejacobson left a comment •

edited

Loading

peytonr18 Oct 22, 2025 •

edited

Loading