-
Notifications
You must be signed in to change notification settings - Fork 15
Audit KVP Tracing and Improve Telemetry #255
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
24ac9ef to
9168733
Compare
… in the telemetry and not just root
9168733 to
21e48ca
Compare
…filter from filtering out INFO spans and therefore missing the current span context.
31873ff to
afda077
Compare
…ected by adding a minimal span to main
afda077 to
8645526
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your changes look awesome! When reading the logs, I had a bit of a pause when seeing the word provision and thought it was from another location. This made me realize that there are two provision functions, which may get confusing in the future.
Also, great explanation in the PR description!!
|
PR descriptions suggests that KVP filter is not working as intended, will there be any changes made to the KVP filter? |
Yes! Technically, the |
cadejacobson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks awesome! LGTM
libazureinit/src/config.rs
Outdated
| ) | ||
| } | ||
|
|
||
| #[instrument(name = "load_config", skip_all)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I second Cade's appreciation for the thorough explanation in the PR description.
The only thing that confuses me a bit is the naming scheme. Any reason why not to just have the names match the path to the functions? The surrounding span context should provide enough clues for more ambiguous function names when looking at a trace?
It indicates to me that we could do a better job naming functions...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, the full path to the function doesn't show up in the spans, which was partially due to wanting to keep additional information as short as possible with the character limit. I renamed some of the spans because I felt like they were too vague to really be helpful without some added context.
ie: it shows get instead of http::get in kvp_pool_1. The azure-init logs give the full path though, so I know it's possible to do.
I didn't want to do too much in renaming functions but... I wouldn't be opposed to doing that instead of having to manually rename the spans! Or, I could try to get the full path as the span name instead of just the function name, whichever you think is best.
anhvoms
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
1ea66e0 to
9362c2a
Compare
4fd1f4f to
a07652a
Compare
|
Looks great, just one minor nit before merging. |
…l data corruption from concurrent access. The lock is acquired before each write operation and released after flushing.
libazureinit/src/kvp.rs
Outdated
| // close the channel and drain remaining messages. | ||
| events.close(); | ||
| while let Some(encoded_kvp) = events.recv().await { | ||
| if let Err(e) = FileExt::lock_exclusive(&file) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
given the duplicate pattern here I suspect there should be write_kvps() method that handles draining available kvps while only taking the lock and calling flush once
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
otherwise everything LGTM
…d ensure we always unlock even after write failures.
This PR improves azure-init's tracing telemetry by ensuring proper span context is preserved in Hyper-V KVP diagnostics and refining span names for better observability. It also introduces a new locking mechanism in the event of concurrent KVP writing.
Background: Spans vs Events in azure-init
In our tracing architecture:
#[instrument]macro on functions or manually viatracing::info_span!().tracing::info!(),tracing::warn!(), etc.When a span is active, all events emitted within it inherit the span's context, creating a hierarchical trace. This allows support engineers to understand the call stack and execution flow when diagnosing issues.
The Problem
Our KVP filter was inadvertently filtering out span context, causing telemetry to lose important hierarchical information:
Key realization: Tracing filters apply to both spans AND events. The filter
azure_init=INFOallowed INFO-level events from theazure_initcrate, butlibazureinit=INFOwas missing, which caused:#[instrument]) to be filtered out.libazureinit::config::success). This resulted in events being associated with therootspan rather than the span created for a given function via the#[instrument]macro.Example:
now, we see
mark_provisioning_completeand it's span_id instead ofrootand it's id:By adding
libazureinit=INFOto the filter, we now:default_kvp_filterwhile keeping INFO-level spans from libazureinit functions.Additional Changes:
#[instrument]macro to functions throughoutlibazureinitwithtracing!calls.get-->empty_http_getrootspan to match its function name and library nameazure_init_provision.azure_init_runspans for pre-provision telemetry to keepazure_init_provisionas the root / top hierarchy span.stderrand theazure-init.logfile layers useFmtSpan::NEW | FmtSpan::CLOSE, which emits a log entry when a span is created (function entered) and a span is closed (function returns) and provides extra visibility into the call flow and timing.Locking Functionality:
Concept:
Inside
azure-init, we funnel all telemetry through a singleUnboundedReceiver, so one task serializes every append and we never have in-process contention.Currently, there is a gap once external crates or other tools start writing to the same KVP pool. The
.appenddoesn't guarantee whole-record writes across processes, so two writers could still corrupt entries. Adding the advisory lock forces every cooperating writer to take a turn writing to the file, giving the ability for atomic writes even when multiple processes are active.Changes:
UnboundedReceiver, so only one task inside this process ever appends to the KVP file; batching is only about reducing lock churn, not fixing in-process contention.write_kvpshelper collects whatever is pending, takes the advisory lock once, writes the entire batch, flushes once, and releases the lock. Both the steady-state loop and the shutdown drain call into it.Related to #239
Proof of Changes
Section from
kvp_pool_1: