-
Notifications
You must be signed in to change notification settings - Fork 941
OTEP: Process Context: Sharing Resource Attributes with External Readers #4719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
OTEP: Process Context: Sharing Resource Attributes with External Readers #4719
Conversation
This OTEP introduces a standard mechanism for OpenTelemetry SDKs to publish process-level resource attributes for access by out-of-process readers such as the OpenTelemetry eBPF Profiler. External readers like the OpenTelemetry eBPF Profiler operate outside the instrumented process and cannot access resource attributes configured within OpenTelemetry SDKs. We propose a mechanism for OpenTelemetry SDKs to publish process-level resource attributes, through a standard format based on Linux anonymous memory mappings. When an SDK initializes (or updates its resource attributes) it publishes this information to a small, fixed-size memory region that external processes can discover and read. The OTEL eBPF profiler will then, upon observing a previously-unseen process, probe and read this information, associating it with any profiling samples taken from a given process. _I'm opening this PR as a draft with the intention of sharing with the Profiling SIG for an extra round of feedback before asking for a wider review._ _This OTEP is based on [Sharing Process-Level Resource Attributes with the OpenTelemetry eBPF Profiler](https://docs.google.com/document/d/1-4jo29vWBZZ0nKKAOG13uAQjRcARwmRc4P313LTbPOE/edit?tab=t.0), big thanks to everyone that provided feedback and helped refine the idea so far._
|
Marking as ready for review! |
|
So this would be a new requirement for eBPF profiler implementations? My issue is the lack of safe support for Erlang/Elixir to do this. While something that could just be accessed as a file or socket wouldn't have that issue. We'd have to pull in a third party, or implement ourselves, library that is a NIF to make these calls and that brings in instability many would rather not have when the goal of our SDK is to not be able to bring down a users program if the SDk crashes -- unless they specifically configure it to do so. |
No, hard requirement should not be the goal: for starters, this is Linux-only (for now), so right off the gate this means it's not going to be available everywhere. Having this discussion is exactly why it was included as one of the open questions in the doc 👍 Our view is that we should go for recommended to implement and recommended to enable by default. In languages/runtimes where it's easy to do so (Go, Rust, Java 22+, possibly Ruby, ...etc?) we should be able to deliver this experience. For others, such as Erlang/Elixir, Java 8-21 (requires a native library, similar to Erlang/Elixir), the goal would be to make it very easy to enable/use for users that want it, but still optional so as to not impact anyone that is not interested. We should probably record the above guidance on the OTEP, if/once we're happy with it 🤔 |
|
cc @open-telemetry/specs-entities-approvers for extra eyes |
|
This PR was marked stale due to lack of activity. It will be closed in 7 days. |
oteps/profiles/4719-process-ctx.md
Outdated
|
|
||
| External readers like the OpenTelemetry eBPF Profiler operate outside the instrumented process and cannot access resource attributes configured within OpenTelemetry SDKs. This creates several problems: | ||
|
|
||
| - **Missing cross-signal correlation identifiers**: Runtime-generated attributes ([`service.instance.id`](https://opentelemetry.io/docs/specs/semconv/registry/attributes/service/#service-instance-id) being a key example) are often inaccessible to external readers, making it hard to correlate profiles with other telemetry (such as traces and spans!) from the same service instance (especially in runtimes that employ multiple processes). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| - **Missing cross-signal correlation identifiers**: Runtime-generated attributes ([`service.instance.id`](https://opentelemetry.io/docs/specs/semconv/registry/attributes/service/#service-instance-id) being a key example) are often inaccessible to external readers, making it hard to correlate profiles with other telemetry (such as traces and spans!) from the same service instance (especially in runtimes that employ multiple processes). | |
| - **Missing cross-signal correlation identifiers**: Runtime-generated attributes ([`service.instance.id`](https://opentelemetry.io/docs/specs/semconv/registry/attributes/service/#service-instance-id) being a key example) are often inaccessible to external readers, making it hard to correlate various signals with each other). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about keeping the comment about the runtimes with multiple processes? I think that's one good use-case where it's especially hard to map what multiple pids seen from the outside actually are.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tweaked the description here in b1583c6
oteps/profiles/4719-process-ctx.md
Outdated
| | Field | Type | Description | | ||
| |-------------------|-----------|----------------------------------------------------------------------| | ||
| | `signature` | `char[8]` | Set to `"OTEL_CTX"` when the payload is ready (written last) | | ||
| | `version` | `uint32` | Format version. Currently `2` (`1` was used for development) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Development versions should not matter at this point as this OTEP is the point of introduction. All previous work is just for experimentation.
| | `version` | `uint32` | Format version. Currently `2` (`1` was used for development) | | |
| | `version` | `uint32` | Format version. Currently `1`. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Starting at 2 would make it really easy to distinguish from the earlier experiments that we deployed in a lot of spots already...
Since there's space for uint32 different versions, do you see starting at 2 as a big blocker? (I can still remove the comment explaining what 1 was, I agree it's TMI)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Starting at 2 is not a blocker to me. It just feels strange that this OTel protocol starts at 2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's slightly annoying that in most cases v0 is the development one, but in this case we are reserving 0 to "not filled in yet" which is why 1 ended up being the development version.
|
|
||
| ### Publication Protocol | ||
|
|
||
| Publishing the context should follow these steps: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As context sharing provides also an opportunity for others, what is the idea for other OS than Linux (or more general OS that don't have a mmap syscall).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For windows, we've experimented at Datadog with using an in-memory file. For macOS it's a bit more nebulous: we can still use mmap, and maybe combine it with mach_vm_region to discover the region?
While this mechanism can be extended to other OS's in the future, our thinking so far was that since the eBPF profiler is Linux-only, the main focus should be on getting Linux support in really amazing shape and then later extend as-needed.
oteps/profiles/4719-process-ctx.md
Outdated
| 8. **Set read-only**: Apply `mprotect(..., PROT_READ)` to mark the mapping as read-only | ||
| 9. **Name mapping** (Linux ≥5.17): Use `prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, ..., "OTEL_CTX")` to name the mapping | ||
|
|
||
| The signature MUST be written last to ensure readers never observe incomplete or invalid data. Once the signature is present and the mapping set to read-only, the entire mapping is considered valid and immutable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it simplify the publication protocol to require the writer to set published_at_ns to a time in the future, when writing the data is guaranteed to be finished?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. In theory a "malicious"/buggy/overloaded scheduler could always schedule out the thread after writing the timestamp and before it finished the rest of the steps...
One really nice property is that the pages are zeroed out by the kernel so it shouldn't be possible to observe anything else other than zeroes or valid data.
Co-authored-by: Florian Lehner <florianl@users.noreply.github.com>
oteps/profiles/4719-process-ctx.md
Outdated
|
|
||
| When an SDK initializes (or updates its resource attributes) it publishes this information to a small, fixed-size memory region that external processes can discover and read. | ||
|
|
||
| The OTEL eBPF profiler will then, upon observing a previously-unseen process, probe and read this information, associating it with any profiling samples taken from a given process. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please describe how it would/could/(or won't) work when an application is instrumented with OBI (https://github.com/open-telemetry/opentelemetry-ebpf-instrumentation)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this question!
I researched this and my conclusion is that right now this won't work with OBI.
From what I'm seeing, while it's possible for ebpf programs to write into userspace using bpf_probe_write_user (and this is already used by OBI to support GO tracing), I don't see a way to do the other things listed in the publication protocol, such as allocating (small amounts of) memory, or invoking system calls to set up the naming and the inheritance permissions.
That said, I don't think this would necessarily be a blocker for OBI-to-OTEL eBPF Profiler communication, since we could introduce a specific out-of-band channel between them using the existing kernel eBPF primitives; but given the current limitations of eBPF I don't think we can get OBI to implement this specification on behalf of an instrumented application.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please document it in the OTEP?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added in 9c8d9ed
Following discussion so far, we can probably avoid having our home-grown `OtelProcessCtx` and instead use the common OTEL `Resource` message.
This PR adds an experimental C/C++ implementation for the "Process Context" OTEP being proposed in open-telemetry/opentelemetry-specification#4719 This implementation previously lived in https://github.com/ivoanjo/proc-level-demo/tree/main/anonmapping-clib and as discussed during the OTEL profiling SIG meeting we want to add it to this repository so it becomes easier to find and contribute to. I've made sure to include a README explaining how to use it. Here's the ultra-quick start (Linux-only): ```bash $ ./build.sh $ ./build/example_ctx --keep-running Published: service=my-service, instance=123d8444-2c7e-46e3-89f6-6217880f7123, env=prod, version=4.5.6, sdk=example_ctx.c/c/1.2.3, resources=resource.key1=resource.value1,resource.key2=resource.value2 Continuing forever, to exit press ctrl+c... TIP: You can now `sudo ./otel_process_ctx_dump.sh 267023` to see the context # In another shell $ sudo ./otel_process_ctx_dump.sh 267023 # Update this to match the PID from above Found OTEL context for PID 267023 Start address: 756f28ce1000 00000000 4f 54 45 4c 5f 43 54 58 02 00 00 00 0b 68 55 47 |OTEL_CTX.....hUG| 00000010 70 24 7d 18 50 01 00 00 a0 82 6d 7e 6a 5f 00 00 |p$}.P.....m~j_..| 00000020 Parsed struct: otel_process_ctx_signature : "OTEL_CTX" otel_process_ctx_version : 2 otel_process_ctx_published_at_ns : 1764606693650819083 (2025-12-01 16:31:33 GMT) otel_process_payload_size : 336 otel_process_payload : 0x00005f6a7e6d82a0 Payload dump (336 bytes): 00000000 0a 25 0a 1b 64 65 70 6c 6f 79 6d 65 6e 74 2e 65 |.%..deployment.e| 00000010 6e 76 69 72 6f 6e 6d 65 6e 74 2e 6e 61 6d 65 12 |nvironment.name.| ... Protobuf decode: attributes { key: "deployment.environment.name" value { string_value: "prod" } } attributes { key: "service.instance.id" value { string_value: "123d8444-2c7e-46e3-89f6-6217880f7123" } } attributes { key: "service.name" value { string_value: "my-service" } } ... ``` Note that because the upstream OTEP is still under discussion, this implementation is experimental and may need changes to match up with the final version of the OTEP.
As pointed out during review, these don't necessarily exist for some resources so let's streamline the spec for now.
| option go_package = "go.opentelemetry.io/proto/otlp/resource/v1"; | ||
|
|
||
| // Resource information. | ||
| message Resource { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late question - but this just popped into my mind:
What is the idea of going forward using message Resource for sharing thread state information or more process internals?
Iirc this approach should also be used later on to provide more information about process internals. But Resource.attributes only holds information covered by OTel Semantic Convention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the idea of going forward using message Resource for sharing thread state information or more process internals?
I suspect protobuf will be a bit too heavy/awkward for the thread state payload format BUT my thinking is that anything we put there should otherwise map to/from attributes.
But Resource.attributes only holds information covered by OTel Semantic Convention.
Actually I don't think that's the case? I've seen a lot of prior art for custom attributes, so anything we don't think should end up in semantic conventions could stay as a custom attribute. I think? 👀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On a second pass, inspired by https://opentelemetry.io/docs/concepts/resources/#custom-resources I've added a note about custom attributes in 17ec933
|
|
||
| - **Inconsistent resource attributes across signals**: Running in different scopes, configuration such as `service.name`, `deployment.environment.name`, and `service.version` are not always available or resolves consistently between the OpenTelemetry SDKs and external readers, leading to configuration drift and inconsistent tagging. | ||
|
|
||
| - **Correlation is dependent on process activity**: If a service is blocked (such as when doing slow I/O, or threads are actually deadlocked) and not emitting other signals, external readers have difficulty identifying it, since resource attributes or identifiers are only sent along when signals are reported. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this relevant to the issue we're trying to solve with this OTEP meaning isn't this problem still going to exist with eBPF profiler even if we adopt the proposed mechanism? Maybe add a clarification that for eBPF profiler this behavior is unaffected by the proposed mechanism?
(I don't think we should remove it as it's contextual information but as it's currently listed in Motivation there's room for misunderstanding)
If there's something else you had in mind re: different external reader, feel free to clarify.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thinking behind this point is two-fold:
-
off-cpu/wall-time profiling
My thinking is since the OTEL eBPF profiler already supports off-cpu profiles, for such samples, we would add support for including the process context as well.
+1 that indeed "can read even when there's no activity" would not impact CPU profiling, since CPU profiling is only concerned about activity.
If, in the future, wall-time profiling (e.g. a combination of on-cpu and off-cpu) was added to the OTEL eBPF profiler, that would be another use-case for this mechanism.
-
non-reliance on mechanisms that require activity from the application
If we were to try to solve the process context problem with an approach of having the application calling something from time to time (or once/a few times, after handshaking with the reader), such a solution would be fragile in the presence of applications that are blocked/stuck, if the application for some reason stops performing those calls.
The current solution is not affected by this since the process context setup is intended to be performed once at application start, in a fire-and-forget way, independently of what the reader is doing.
oteps/profiles/4719-process-ctx.md
Outdated
|
|
||
| Publishing the context should follow these steps: | ||
|
|
||
| 1. **Drop existing mapping**: If a previous context was published, unmap/free it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to drop the existing mapping? If we keep it fixed, the reader may cache the address for the target process which simplifies checking if the data has been updated (no overhead of re-parsing mappings, this can also help with higher-frequency updates).
Since the payload pointer can point to anywhere in target process memory, we'll never be limited by the two pages fixed mapping size (meaning we don't need to grow this mapping to span more pages either during process runtime or in the future).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to drop the existing mapping?
Not strictly?
For the existing approach, it's possible to avoid polling mappings to figure out the address by:
- Checking that
published_at_nscan be read and hasn't changed and/or - Hooking on prctl calls
Reusing the mapping instead of dropping it does not conflict with the above approaches, but... I think it would complicate concurrency control on the reader. That is, having this invariant allows the reader to reader know that while the mapping is up, the payload is valid and consistent as far as the writer is concerned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we allow the mapping address to change, then checking published_at_ns is not reliable with the existing protocol (for example, we'd need to overwrite with zeroes).
But the point I'm making is more general: The current update protocol mentions that the "previous mapping should be removed" before publishing new ones. If we assume that most implementors abide by this, then the overhead of parsing mappings will be there. For a reader like eBPF profiler that may have to manage hundreds of processes as a worst case, that overhead of constantly hitting /proc could be significant.
Can we examine the concurrency control edge cases in more detail? It should be possible to provide the same guarantees as now while keeping a fixed mapping.
We should avoid relying on hooking prctl IMO (it also doesn't solve the constant /proc access problem if most implementors change the mapping on every update)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we allow the mapping address to change, then checking published_at_ns is not reliable with the existing protocol (for example, we'd need to overwrite with zeroes).
I think it is! Let me try to convince you ;)
After a context gets dropped one of two things happens:
a) The mapping becomes invalid. This would make reads return an error, which would be a clear indication of not valid.
b) A new mapping (otel or not) gets put in its place. Reads to the old location of published_at_ns would return whatever's there now. Note that this would would not be published_at_ns because the kernel zeroes out memory before mapping it (e.g. this is not regular malloc/free) and thus I don't think it's possible for leftover garbage to exist to confuse the reader. (Edit: And thus the reader will know what it read is not valid)
For a reader like eBPF profiler that may have to manage hundreds of processes as a worst case, that overhead of constantly hitting /proc could be significant.
The strategy above means that once a process context is detected, we can continue to detect its presence by reading the same address in a cheap way, even without hooking prctl.
(The reader may even choose to do time-based caching, e.g. read the context and reuse it for the next N seconds/minutes, rather than trying to always have the latest up-to-date version if it wants to even save more reads)
We should avoid relying on hooking prctl IMO (it also doesn't solve the constant /proc access problem if most implementors change the mapping on every update)
To be clear, I believe prctl is not needed at all to be able to follow invalidation of existing contexts/creation of new ones, it's a fully optional possibility.
We could even completely omit references to hooking on prctl in the current spec -- but I think it's an interesting feature to document in the spec for readers that want to use it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The strategy above means that once a process context is detected, we can continue to detect its presence by reading the same address in a cheap way, even without hooking
prctl.
True, that speeds up detection but the overhead of parsing mappings before fetching an update is still there. It also makes for a more complicated update protocol, maybe limiting update operation frequency.
Advantages for keeping the mapping fixed:
- Simpler publisher logic
- Simpler reader logic
- Minimal (non-existent after mapping is first detected in the reader) accessing and processing
/proc/overhead - Scales to thousands of processes
- Scales to higher frequency updates, minimizing possibility of stale data
Can we clarify the disadvantages?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I agree this or something like this can be made to work. (I didn't quite get the reference to
mprotect? Do you mean just "I'm omitting the mprotect parts?")
Yeah, to keep the focus on the lock-free data exchange part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got some numbers to ground my assumptions in actuality and yes it does seem like the extra condition will make a difference (so flipping RO status or a different way to cut down on number of mappings would be needed):
We did experiment at the beginning with making the permissions no read, no write, execute, which is a really odd combination (and thus very very rare) and almost got away with it but discovered that there's two paths for reading memory from another process in the Linux kernel and
process_vm_readvactually goes through the path that respects page permissions, which made this approach a bit more awkward.
Another option that avoids mprotect can use MAP_FIXED and an address generating scheme based on a deterministic pattern. We have terabytes of mostly unused address space to play with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been staring at my notebook for a while and maybe have an idea for making "reusing the mapping" work.
Considering the current "publication protocol" in the spec:
Publishing the context should follow these steps:
- Drop existing mapping: If a previous context was published, unmap/free it
- Allocate new mapping: Create a 2-page anonymous mapping via
mmap()(These pages are always zeroed by Linux)- Prevent fork inheritance: Apply
madvise(..., MADV_DONTFORK)to prevent child processes from inheriting stale data- Encode payload: Serialize the payload message using protobuf (storing it either following the header OR in a separate memory allocation)
- Write header fields: Populate
version,published_at_ns,payload_size,payload- Memory barrier: Use language/compiler-specific techniques to ensure all previous writes complete before proceeding
- Write signature: Write
OTEL_CTXto the signature field last- Set read-only: Apply
mprotect(..., PROT_READ)to mark the mapping as read-only- Name mapping: Use
prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, ..., "OTEL_CTX")to name the mapping. This step should be done unconditionally, although naming mappings is not always supported by the kernel.
We can document the update protocol as being not destroy old mapping + create new one as you suggested. Update: AND, instead do an in-place update as described below.
To support it, we could go through the publishing steps in reverse. That is, a process that wants to update its mappings goes like this:
i. Undo 9 -- removes name
ii. Undo 8 -- sets memory back to R/W
iii. Undo 7 -- zeroes signature
iv. Barrier, like in 6
v. Undo 5 -- zero fields
vi. Barrier
vii. Then start again from 4 as if this was a new mapping
I believe in this case the reading protocol would not need to change, since it already says
Validate signature and version:
- Read the header and verify first 8 bytes matches
OTEL_CTX- Read the version field and verify it is supported (currently
2)- If either check fails, skip this mapping
Read payload: Read
payload_sizebytes starting after the headerRe-read header: If the header has not changed, the read of header + payload is consistent. This ensures there were no concurrent changes to the process context. If the header changed, restart at 1.
A reader that observes the original mapping or the fully updated one will work as expected.
A reader that is trying to locate the mapping will find an invalid mapping because some of the fields of the mapping will be zero, and it's not valid for them to be zero. So it'll skip the mapping.
A reader that already located the mapping and is polling for updates:
- If it observes at step i reads the old content correctly
- If it observes at step ii reads the old content correctly
- If it observes at step iii/iv will find the header to be invalid (zeroed signature) and will know an update is ongoing/the context is not valid
- If it observes at step v/vi the signature is still invalid
- As the usual publish protocol goes down, only after 7 will the mapping the valid again
Furthermore we already state that the header gets read twice, and this would made sure that if the reader reads the full old header, and then while it's reading the payload the new update starts, then when the reader looks at the header again it'll see that it's not the same as the old one (fields are either different value or zero).
(This approach supports the payload-after-header because the payload only starts being modified after zeroing out the header and thus the reader can tell, when it re-reads the header, that it's no longer valid)
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the counter example a lot simpler for publish/update/read as:
- We don't need to modify the signature (no zero out and write again cost)
- We don't need to modify the name (no remove and write again cost)
- We don't need to zero out the fields (timestamp)
- We only need to read the counter again (on the publisher side: very cheap 64bit write operation) to establish if fetching the update was complete / not interrupted
Also if memfd_create is viable, we could (in addition to the above) get rid of the mprotect operations. This would give us a protocol that we could also leverage / build upon for (possibly higher-frequency) thread context updates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the counter example a lot simpler for publish/update/read as:
- We don't need to modify the signature (no zero out and write again cost)
- We don't need to modify the name (no remove and write again cost)
- We don't need to zero out the fields (timestamp)
- We only need to read the counter again (on the publisher side: very cheap 64bit write operation) to establish if fetching the update was complete / not interrupted
Compared to the counter approach, the "do a few things backwards approach":
- Allows us to end up with the expected mprotect flags at the end, meaning finding the context is still as cheap as in the current OTEP doc
- It still allows hooking on prctl to detect updates
- It does not require changes to readers -- the existing spec covers these kind of operations too
For these reasons, I think the "do a few things backwards approach" fits a bit better, but happy to discuss/flesh it out if you're not convinced.
Also if
memfd_createis viable, we could (in addition to the above) get rid of themprotectoperations. This would give us a protocol that we could also leverage / build upon for (possibly higher-frequency) thread context updates.
I'll comment on memfd as an alternative separately, there's a different set of trade-offs for that one.
Co-authored-by: Christos Kalkanis <christos.kalkanis@elastic.co>
…n Linux 5.17+ See open-telemetry/sig-profiling#23 for a wider discussion of this.
This PR adds an experimental C/C++ implementation for the "Process Context" OTEP being proposed in open-telemetry/opentelemetry-specification#4719 This implementation previously lived in https://github.com/ivoanjo/proc-level-demo/tree/main/anonmapping-clib and as discussed during the OTEL profiling SIG meeting we want to add it to this repository so it becomes easier to find and contribute to. I've made sure to include a README explaining how to use it. Here's the ultra-quick start (Linux-only): ```bash $ ./build.sh $ ./build/example_ctx --keep-running Published: service=my-service, instance=123d8444-2c7e-46e3-89f6-6217880f7123, env=prod, version=4.5.6, sdk=example_ctx.c/c/1.2.3, resources=resource.key1=resource.value1,resource.key2=resource.value2 Continuing forever, to exit press ctrl+c... TIP: You can now `sudo ./otel_process_ctx_dump.sh 267023` to see the context # In another shell $ sudo ./otel_process_ctx_dump.sh 267023 # Update this to match the PID from above Found OTEL context for PID 267023 Start address: 756f28ce1000 00000000 4f 54 45 4c 5f 43 54 58 02 00 00 00 0b 68 55 47 |OTEL_CTX.....hUG| 00000010 70 24 7d 18 50 01 00 00 a0 82 6d 7e 6a 5f 00 00 |p$}.P.....m~j_..| 00000020 Parsed struct: otel_process_ctx_signature : "OTEL_CTX" otel_process_ctx_version : 2 otel_process_ctx_published_at_ns : 1764606693650819083 (2025-12-01 16:31:33 GMT) otel_process_payload_size : 336 otel_process_payload : 0x00005f6a7e6d82a0 Payload dump (336 bytes): 00000000 0a 25 0a 1b 64 65 70 6c 6f 79 6d 65 6e 74 2e 65 |.%..deployment.e| 00000010 6e 76 69 72 6f 6e 6d 65 6e 74 2e 6e 61 6d 65 12 |nvironment.name.| ... Protobuf decode: attributes { key: "deployment.environment.name" value { string_value: "prod" } } attributes { key: "service.instance.id" value { string_value: "123d8444-2c7e-46e3-89f6-6217880f7123" } } attributes { key: "service.name" value { string_value: "my-service" } } ... ``` Note that because the upstream OTEP is still under discussion, this implementation is experimental and may need changes to match up with the final version of the OTEP. --------- Co-authored-by: Christos Kalkanis <christos.kalkanis@elastic.co>
felixge
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a comment, but overall LGTM. Happy to approve once the open discussions threads have been resolved.
oteps/profiles/4719-process-ctx.md
Outdated
| 5. **Write header fields**: Populate `version`, `published_at_ns`, `payload_size`, `payload` | ||
| 6. **Memory barrier**: Use language/compiler-specific techniques to ensure all previous writes complete before proceeding | ||
| 7. **Write signature**: Write `OTEL_CTX` to the signature field last | ||
| 8. **Set read-only**: Apply `mprotect(..., PROT_READ)` to mark the mapping as read-only |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does the signature need to be written after the memory barrier? Shouldn't the transition to PROT_READ status be atomic? If that's guaranteed to be ordered after all writes to the map, we should be good?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In practice, it should be as you say. In theory... the mprotect is done only on the mapping, not on the payload, so a given language's memory model might be ambiguous on "if you observe the read-only mapping, will the payload be there too" and thus it seems worth it to slightly over-specify it here?
Not an extremely strong reason, I agree -- we could probably simplify this if needed.
|
In light of our discussion here, I ran some more experiments. Listing possible fallbacks/alternatives to read-only mapping for when
Example code for the latter (may need more investigation but it looks promising and should be widely available?). EDIT: Using |
👋 So funny thing you mention memfd 😀. My colleagues at Datadog actually have previously built something close to what this OTEP proposes, using memfd, although in a slightly different way than your gist (of notice, not using mmap together with memfd). The main reasons why we moved away from it for this OTEP were:
I think both approaches are quite similar, especially when involving mmap. (E.g. I suspect we could as well mmap the region into the reader with the current approach in the OTEP. I do think we'd need some kind of cleanup mechanism to detect when the owner of a mapping has gone away, as otherwise I suspect the reader will keep the mapping alive?). So in a way it's more what combination of blocks do we want to use (or mix) 🤔👋:
We at Datadog spent some time exploring the solution space, and tried to come up with a combination of the above that seemed reasonable, given the constraints (and tried to document that as well). But yeah, I won't say it's not possible to do any of the above in a slightly different way, especially given most have trade-offs and there's not been a very clear above-the-rest winner on most points. 😅 |
I think we can come up with something that's flexible but also remains simple for the simple use-cases. The main advantage of To support wakeups instead of polling, without hooking I think that the scheme we end up with in OTel should at least meet the following three criteria:
Based on all the options we laid out, I think that's doable. Optionally (we probably need to expand the scope to thread context to figure out requirements / pick through the following):
[1] Regarding forking, there's a race condition between [2] If we allow |
|
Update for extra context: @ivoanjo and me had a Zoom sync today where we talked about simplifying the current proposal by:
For future discussion:
|
… find mappings After discussion in the PR and great suggestions/experiments from @christos68k, the specification has been updated as such: * Instead of always using an anonymous mapping, try first to create a memfd and create a mapping from the memfd. If due to security restrictions memfd is not available, fall back to an anonymous mapping instead. * Remove probing as a fallback for when naming a mapping fails. Because the name of a memfd also shows up in `/proc/<pid>/maps`, we expect that having `memfd` naming as a fallback for when `prctl` is not available is enough. * Drop requirement for 2-page size and read-only permissions on the header memory pages. These were intented to support the "probing as a fallback for naming failure", so they are no longer needed. * Document "Updating Protocol" for in-place updates to process context. This allows efficient updates. In particular, it makes it easier for the reader to detect updates and avoids reparsing `/proc/<pid>/maps` for updates.
|
I've pushed 3caecfb with the changes described/discussed with @christos68k above. I'm preparing a PR to update the reference C/C++ implementation to match this change; I'll share that one shortly. |
|
The update to the reference C/C++ implementation is in open-telemetry/sig-profiling#34 . As a final quick note I + possibly other folks are going to be out for holidays the next few weeks, so expect discussions to slow down for a bit until we're back in full force in January! |
Changes
External readers like the OpenTelemetry eBPF Profiler operate outside the instrumented process and cannot access resource attributes configured within OpenTelemetry SDKs. We propose a mechanism for OpenTelemetry SDKs to publish process-level resource attributes, through a standard format based on Linux anonymous memory mappings.
When an SDK initializes (or updates its resource attributes) it publishes this information to a small, fixed-size memory region that external processes can discover and read. The OTEL eBPF profiler will then, upon observing a previously-unseen process, probe and read this information, associating it with any profiling samples taken from a given process.
Why open as draft:
I'm opening this PR as a draft with the intention of sharing with the Profiling SIG for an extra round of feedback before asking for a wider review.This OTEP is based on Sharing Process-Level Resource Attributes with the OpenTelemetry eBPF Profiler, big thanks to everyone that provided feedback and helped refine the idea so far.
CHANGELOG.mdfile updated for non-trivial changes