OTEP: Process Context: Sharing Resource Attributes with External Readers #4719

ivoanjo · 2025-10-31T16:08:15Z

Changes

External readers like the OpenTelemetry eBPF Profiler operate outside the instrumented process and cannot access resource attributes configured within OpenTelemetry SDKs. We propose a mechanism for OpenTelemetry SDKs to publish process-level resource attributes, through a standard format based on Linux anonymous memory mappings.

When an SDK initializes (or updates its resource attributes) it publishes this information to a small, fixed-size memory region that external processes can discover and read. The OTEL eBPF profiler will then, upon observing a previously-unseen process, probe and read this information, associating it with any profiling samples taken from a given process.

Why open as draft: ~~I'm opening this PR as a draft with the intention of sharing with the Profiling SIG for an extra round of feedback before asking for a wider review.~~

This OTEP is based on Sharing Process-Level Resource Attributes with the OpenTelemetry eBPF Profiler, big thanks to everyone that provided feedback and helped refine the idea so far.

Related issues #
Related OTEP(s) #212, #239
Links to the prototypes (when adding or changing features)
CHANGELOG.md file updated for non-trivial changes
Spec compliance matrix updated if necessary

This OTEP introduces a standard mechanism for OpenTelemetry SDKs to publish process-level resource attributes for access by out-of-process readers such as the OpenTelemetry eBPF Profiler. External readers like the OpenTelemetry eBPF Profiler operate outside the instrumented process and cannot access resource attributes configured within OpenTelemetry SDKs. We propose a mechanism for OpenTelemetry SDKs to publish process-level resource attributes, through a standard format based on Linux anonymous memory mappings. When an SDK initializes (or updates its resource attributes) it publishes this information to a small, fixed-size memory region that external processes can discover and read. The OTEL eBPF profiler will then, upon observing a previously-unseen process, probe and read this information, associating it with any profiling samples taken from a given process. _I'm opening this PR as a draft with the intention of sharing with the Profiling SIG for an extra round of feedback before asking for a wider review._ _This OTEP is based on [Sharing Process-Level Resource Attributes with the OpenTelemetry eBPF Profiler](https://docs.google.com/document/d/1-4jo29vWBZZ0nKKAOG13uAQjRcARwmRc4P313LTbPOE/edit?tab=t.0), big thanks to everyone that provided feedback and helped refine the idea so far._

ivoanjo · 2025-11-05T12:19:17Z

Marking as ready for review!

tsloughter · 2025-11-05T14:36:12Z

So this would be a new requirement for eBPF profiler implementations?

My issue is the lack of safe support for Erlang/Elixir to do this. While something that could just be accessed as a file or socket wouldn't have that issue. We'd have to pull in a third party, or implement ourselves, library that is a NIF to make these calls and that brings in instability many would rather not have when the goal of our SDK is to not be able to bring down a users program if the SDk crashes -- unless they specifically configure it to do so.

ivoanjo · 2025-11-06T11:11:55Z

So this would be a new requirement for eBPF profiler implementations?

No, hard requirement should not be the goal: for starters, this is Linux-only (for now), so right off the gate this means it's not going to be available everywhere.

Having this discussion is exactly why it was included as one of the open questions in the doc 👍

Our view is that we should go for recommended to implement and recommended to enable by default.

In languages/runtimes where it's easy to do so (Go, Rust, Java 22+, possibly Ruby, ...etc?) we should be able to deliver this experience.

For others, such as Erlang/Elixir, Java 8-21 (requires a native library, similar to Erlang/Elixir), the goal would be to make it very easy to enable/use for users that want it, but still optional so as to not impact anyone that is not interested.

We should probably record the above guidance on the OTEP, if/once we're happy with it 🤔

carlosalberto · 2025-11-07T19:06:38Z

cc @open-telemetry/specs-entities-approvers for extra eyes

github-actions · 2025-11-15T03:28:54Z

This PR was marked stale due to lack of activity. It will be closed in 7 days.

oteps/profiles/4719-process-ctx.md

florianl · 2025-11-17T08:28:56Z

oteps/profiles/4719-process-ctx.md

+
+External readers like the OpenTelemetry eBPF Profiler operate outside the instrumented process and cannot access resource attributes configured within OpenTelemetry SDKs. This creates several problems:
+
+- **Missing cross-signal correlation identifiers**: Runtime-generated attributes ([`service.instance.id`](https://opentelemetry.io/docs/specs/semconv/registry/attributes/service/#service-instance-id) being a key example) are often inaccessible to external readers, making it hard to correlate profiles with other telemetry (such as traces and spans!) from the same service instance (especially in runtimes that employ multiple processes).


Suggested change

- **Missing cross-signal correlation identifiers**: Runtime-generated attributes ([`service.instance.id`](https://opentelemetry.io/docs/specs/semconv/registry/attributes/service/#service-instance-id) being a key example) are often inaccessible to external readers, making it hard to correlate profiles with other telemetry (such as traces and spans!) from the same service instance (especially in runtimes that employ multiple processes).

- **Missing cross-signal correlation identifiers**: Runtime-generated attributes ([`service.instance.id`](https://opentelemetry.io/docs/specs/semconv/registry/attributes/service/#service-instance-id) being a key example) are often inaccessible to external readers, making it hard to correlate various signals with each other).

What do you think about keeping the comment about the runtimes with multiple processes? I think that's one good use-case where it's especially hard to map what multiple pids seen from the outside actually are.

I've tweaked the description here in b1583c6

oteps/profiles/4719-process-ctx.md

florianl · 2025-11-17T08:36:29Z

oteps/profiles/4719-process-ctx.md

+| Field             | Type      | Description                                                          |
+|-------------------|-----------|----------------------------------------------------------------------|
+| `signature`       | `char[8]` | Set to `"OTEL_CTX"` when the payload is ready (written last)         |
+| `version`         | `uint32`  | Format version. Currently `2` (`1` was used for development)         |


Starting at 2 would make it really easy to distinguish from the earlier experiments that we deployed in a lot of spots already...

Since there's space for uint32 different versions, do you see starting at 2 as a big blocker? (I can still remove the comment explaining what 1 was, I agree it's TMI)

Starting at 2 is not a blocker to me. It just feels strange that this OTel protocol starts at 2.

Yeah, it's slightly annoying that in most cases v0 is the development one, but in this case we are reserving 0 to "not filled in yet" which is why 1 ended up being the development version.

oteps/profiles/4719-process-ctx.md

florianl · 2025-11-17T08:50:52Z

oteps/profiles/4719-process-ctx.md

+
+### Publication Protocol
+
+Publishing the context should follow these steps:


As context sharing provides also an opportunity for others, what is the idea for other OS than Linux (or more general OS that don't have a mmap syscall).

For windows, we've experimented at Datadog with using an in-memory file. For macOS it's a bit more nebulous: we can still use mmap, and maybe combine it with mach_vm_region to discover the region?

While this mechanism can be extended to other OS's in the future, our thinking so far was that since the eBPF profiler is Linux-only, the main focus should be on getting Linux support in really amazing shape and then later extend as-needed.

florianl · 2025-11-17T08:54:18Z

oteps/profiles/4719-process-ctx.md

+8. **Set read-only**: Apply `mprotect(..., PROT_READ)` to mark the mapping as read-only
+9. **Name mapping** (Linux ≥5.17): Use `prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, ..., "OTEL_CTX")` to name the mapping
+
+The signature MUST be written last to ensure readers never observe incomplete or invalid data. Once the signature is present and the mapping set to read-only, the entire mapping is considered valid and immutable.


Would it simplify the publication protocol to require the writer to set published_at_ns to a time in the future, when writing the data is guaranteed to be finished?

I don't think so. In theory a "malicious"/buggy/overloaded scheduler could always schedule out the thread after writing the timestamp and before it finished the rest of the steps...

One really nice property is that the pages are zeroed out by the kernel so it shouldn't be possible to observe anything else other than zeroes or valid data.

oteps/profiles/4719-process-ctx.md

Co-authored-by: Florian Lehner <florianl@users.noreply.github.com>

oteps/profiles/4719-process-ctx.md

pellared · 2025-11-25T16:17:25Z

oteps/profiles/4719-process-ctx.md

+
+When an SDK initializes (or updates its resource attributes) it publishes this information to a small, fixed-size memory region that external processes can discover and read.
+
+The OTEL eBPF profiler will then, upon observing a previously-unseen process, probe and read this information, associating it with any profiling samples taken from a given process.


Could you please describe how it would/could/(or won't) work when an application is instrumented with OBI (https://github.com/open-telemetry/opentelemetry-ebpf-instrumentation)?

Thanks for this question!

I researched this and my conclusion is that right now this won't work with OBI.

From what I'm seeing, while it's possible for ebpf programs to write into userspace using bpf_probe_write_user (and this is already used by OBI to support GO tracing), I don't see a way to do the other things listed in the publication protocol, such as allocating (small amounts of) memory, or invoking system calls to set up the naming and the inheritance permissions.

That said, I don't think this would necessarily be a blocker for OBI-to-OTEL eBPF Profiler communication, since we could introduce a specific out-of-band channel between them using the existing kernel eBPF primitives; but given the current limitations of eBPF I don't think we can get OBI to implement this specification on behalf of an instrumented application.

Can you please document it in the OTEP?

Added in 9c8d9ed

Following discussion so far, we can probably avoid having our home-grown `OtelProcessCtx` and instead use the common OTEL `Resource` message.

This PR adds an experimental C/C++ implementation for the "Process Context" OTEP being proposed in open-telemetry/opentelemetry-specification#4719 This implementation previously lived in https://github.com/ivoanjo/proc-level-demo/tree/main/anonmapping-clib and as discussed during the OTEL profiling SIG meeting we want to add it to this repository so it becomes easier to find and contribute to. I've made sure to include a README explaining how to use it. Here's the ultra-quick start (Linux-only): ```bash $ ./build.sh $ ./build/example_ctx --keep-running Published: service=my-service, instance=123d8444-2c7e-46e3-89f6-6217880f7123, env=prod, version=4.5.6, sdk=example_ctx.c/c/1.2.3, resources=resource.key1=resource.value1,resource.key2=resource.value2 Continuing forever, to exit press ctrl+c... TIP: You can now `sudo ./otel_process_ctx_dump.sh 267023` to see the context # In another shell $ sudo ./otel_process_ctx_dump.sh 267023 # Update this to match the PID from above Found OTEL context for PID 267023 Start address: 756f28ce1000 00000000 4f 54 45 4c 5f 43 54 58 02 00 00 00 0b 68 55 47 |OTEL_CTX.....hUG| 00000010 70 24 7d 18 50 01 00 00 a0 82 6d 7e 6a 5f 00 00 |p$}.P.....m~j_..| 00000020 Parsed struct: otel_process_ctx_signature : "OTEL_CTX" otel_process_ctx_version : 2 otel_process_ctx_published_at_ns : 1764606693650819083 (2025-12-01 16:31:33 GMT) otel_process_payload_size : 336 otel_process_payload : 0x00005f6a7e6d82a0 Payload dump (336 bytes): 00000000 0a 25 0a 1b 64 65 70 6c 6f 79 6d 65 6e 74 2e 65 |.%..deployment.e| 00000010 6e 76 69 72 6f 6e 6d 65 6e 74 2e 6e 61 6d 65 12 |nvironment.name.| ... Protobuf decode: attributes { key: "deployment.environment.name" value { string_value: "prod" } } attributes { key: "service.instance.id" value { string_value: "123d8444-2c7e-46e3-89f6-6217880f7123" } } attributes { key: "service.name" value { string_value: "my-service" } } ... ``` Note that because the upstream OTEP is still under discussion, this implementation is experimental and may need changes to match up with the final version of the OTEP.

As pointed out during review, these don't necessarily exist for some resources so let's streamline the spec for now.

florianl · 2025-12-05T13:21:51Z

oteps/profiles/4719-process-ctx.md

+option go_package = "go.opentelemetry.io/proto/otlp/resource/v1";
+
+// Resource information.
+message Resource {


Sorry for the late question - but this just popped into my mind:

What is the idea of going forward using message Resource for sharing thread state information or more process internals?

Iirc this approach should also be used later on to provide more information about process internals. But Resource.attributes only holds information covered by OTel Semantic Convention.

What is the idea of going forward using message Resource for sharing thread state information or more process internals?

I suspect protobuf will be a bit too heavy/awkward for the thread state payload format BUT my thinking is that anything we put there should otherwise map to/from attributes.

But Resource.attributes only holds information covered by OTel Semantic Convention.

Actually I don't think that's the case? I've seen a lot of prior art for custom attributes, so anything we don't think should end up in semantic conventions could stay as a custom attribute. I think? 👀

On a second pass, inspired by https://opentelemetry.io/docs/concepts/resources/#custom-resources I've added a note about custom attributes in 17ec933

christos68k · 2025-12-05T13:06:56Z

oteps/profiles/4719-process-ctx.md

+
+- **Inconsistent resource attributes across signals**: Running in different scopes, configuration such as `service.name`, `deployment.environment.name`, and `service.version` are not always available or resolves consistently between the OpenTelemetry SDKs and external readers, leading to configuration drift and inconsistent tagging.
+
+- **Correlation is dependent on process activity**: If a service is blocked (such as when doing slow I/O, or threads are actually deadlocked) and not emitting other signals, external readers have difficulty identifying it, since resource attributes or identifiers are only sent along when signals are reported.


Is this relevant to the issue we're trying to solve with this OTEP meaning isn't this problem still going to exist with eBPF profiler even if we adopt the proposed mechanism? Maybe add a clarification that for eBPF profiler this behavior is unaffected by the proposed mechanism?

(I don't think we should remove it as it's contextual information but as it's currently listed in Motivation there's room for misunderstanding)

If there's something else you had in mind re: different external reader, feel free to clarify.

The thinking behind this point is two-fold:

off-cpu/wall-time profiling

My thinking is since the OTEL eBPF profiler already supports off-cpu profiles, for such samples, we would add support for including the process context as well.

+1 that indeed "can read even when there's no activity" would not impact CPU profiling, since CPU profiling is only concerned about activity.

If, in the future, wall-time profiling (e.g. a combination of on-cpu and off-cpu) was added to the OTEL eBPF profiler, that would be another use-case for this mechanism.

non-reliance on mechanisms that require activity from the application

If we were to try to solve the process context problem with an approach of having the application calling something from time to time (or once/a few times, after handshaking with the reader), such a solution would be fragile in the presence of applications that are blocked/stuck, if the application for some reason stops performing those calls.

The current solution is not affected by this since the process context setup is intended to be performed once at application start, in a fire-and-forget way, independently of what the reader is doing.

oteps/profiles/4719-process-ctx.md

christos68k · 2025-12-05T13:39:07Z

oteps/profiles/4719-process-ctx.md

+
+Publishing the context should follow these steps:
+
+1. **Drop existing mapping**: If a previous context was published, unmap/free it


Do we need to drop the existing mapping? If we keep it fixed, the reader may cache the address for the target process which simplifies checking if the data has been updated (no overhead of re-parsing mappings, this can also help with higher-frequency updates).

Since the payload pointer can point to anywhere in target process memory, we'll never be limited by the two pages fixed mapping size (meaning we don't need to grow this mapping to span more pages either during process runtime or in the future).

Do we need to drop the existing mapping?

Not strictly?

For the existing approach, it's possible to avoid polling mappings to figure out the address by:

Checking that published_at_ns can be read and hasn't changed and/or

Hooking on prctl calls

Reusing the mapping instead of dropping it does not conflict with the above approaches, but... I think it would complicate concurrency control on the reader. That is, having this invariant allows the reader to reader know that while the mapping is up, the payload is valid and consistent as far as the writer is concerned.

If we allow the mapping address to change, then checking published_at_ns is not reliable with the existing protocol (for example, we'd need to overwrite with zeroes).

But the point I'm making is more general: The current update protocol mentions that the "previous mapping should be removed" before publishing new ones. If we assume that most implementors abide by this, then the overhead of parsing mappings will be there. For a reader like eBPF profiler that may have to manage hundreds of processes as a worst case, that overhead of constantly hitting /proc could be significant.

Can we examine the concurrency control edge cases in more detail? It should be possible to provide the same guarantees as now while keeping a fixed mapping.

We should avoid relying on hooking prctl IMO (it also doesn't solve the constant /proc access problem if most implementors change the mapping on every update)

If we allow the mapping address to change, then checking published_at_ns is not reliable with the existing protocol (for example, we'd need to overwrite with zeroes).

I think it is! Let me try to convince you ;)

After a context gets dropped one of two things happens:

a) The mapping becomes invalid. This would make reads return an error, which would be a clear indication of not valid.

b) A new mapping (otel or not) gets put in its place. Reads to the old location of published_at_ns would return whatever's there now. Note that this would would not be published_at_ns because the kernel zeroes out memory before mapping it (e.g. this is not regular malloc/free) and thus I don't think it's possible for leftover garbage to exist to confuse the reader. (Edit: And thus the reader will know what it read is not valid)

For a reader like eBPF profiler that may have to manage hundreds of processes as a worst case, that overhead of constantly hitting /proc could be significant.

The strategy above means that once a process context is detected, we can continue to detect its presence by reading the same address in a cheap way, even without hooking prctl.

(The reader may even choose to do time-based caching, e.g. read the context and reuse it for the next N seconds/minutes, rather than trying to always have the latest up-to-date version if it wants to even save more reads)

We should avoid relying on hooking prctl IMO (it also doesn't solve the constant /proc access problem if most implementors change the mapping on every update)

To be clear, I believe prctl is not needed at all to be able to follow invalidation of existing contexts/creation of new ones, it's a fully optional possibility.

We could even completely omit references to hooking on prctl in the current spec -- but I think it's an interesting feature to document in the spec for readers that want to use it.

The strategy above means that once a process context is detected, we can continue to detect its presence by reading the same address in a cheap way, even without hooking prctl.

True, that speeds up detection but the overhead of parsing mappings before fetching an update is still there. It also makes for a more complicated update protocol, maybe limiting update operation frequency.

Advantages for keeping the mapping fixed:

Simpler publisher logic

Simpler reader logic

Minimal (non-existent after mapping is first detected in the reader) accessing and processing /proc/ overhead

Scales to thousands of processes

Scales to higher frequency updates, minimizing possibility of stale data

Can we clarify the disadvantages?

Yeah I agree this or something like this can be made to work. (I didn't quite get the reference to mprotect? Do you mean just "I'm omitting the mprotect parts?")

Yeah, to keep the focus on the lock-free data exchange part.

I got some numbers to ground my assumptions in actuality and yes it does seem like the extra condition will make a difference (so flipping RO status or a different way to cut down on number of mappings would be needed):

We did experiment at the beginning with making the permissions no read, no write, execute, which is a really odd combination (and thus very very rare) and almost got away with it but discovered that there's two paths for reading memory from another process in the Linux kernel and process_vm_readv actually goes through the path that respects page permissions, which made this approach a bit more awkward.

Another option that avoids mprotect can use MAP_FIXED and an address generating scheme based on a deterministic pattern. We have terabytes of mostly unused address space to play with.

I've been staring at my notebook for a while and maybe have an idea for making "reusing the mapping" work.

Considering the current "publication protocol" in the spec:

Publishing the context should follow these steps:

Drop existing mapping: If a previous context was published, unmap/free it

Allocate new mapping: Create a 2-page anonymous mapping via mmap() (These pages are always zeroed by Linux)

Prevent fork inheritance: Apply madvise(..., MADV_DONTFORK) to prevent child processes from inheriting stale data

Encode payload: Serialize the payload message using protobuf (storing it either following the header OR in a separate memory allocation)

Write header fields: Populate version, published_at_ns, payload_size, payload

Memory barrier: Use language/compiler-specific techniques to ensure all previous writes complete before proceeding

Write signature: Write OTEL_CTX to the signature field last

Set read-only: Apply mprotect(..., PROT_READ) to mark the mapping as read-only

Name mapping: Use prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, ..., "OTEL_CTX") to name the mapping. This step should be done unconditionally, although naming mappings is not always supported by the kernel.

We can document the update protocol as being not destroy old mapping + create new one as you suggested. Update: AND, instead do an in-place update as described below.

To support it, we could go through the publishing steps in reverse. That is, a process that wants to update its mappings goes like this:

i. Undo 9 -- removes name
ii. Undo 8 -- sets memory back to R/W
iii. Undo 7 -- zeroes signature
iv. Barrier, like in 6
v. Undo 5 -- zero fields
vi. Barrier
vii. Then start again from 4 as if this was a new mapping

I believe in this case the reading protocol would not need to change, since it already says

Validate signature and version:

Read the header and verify first 8 bytes matches OTEL_CTX

Read the version field and verify it is supported (currently 2)

If either check fails, skip this mapping

Read payload: Read payload_size bytes starting after the header

Re-read header: If the header has not changed, the read of header + payload is consistent. This ensures there were no concurrent changes to the process context. If the header changed, restart at 1.

A reader that observes the original mapping or the fully updated one will work as expected.

A reader that is trying to locate the mapping will find an invalid mapping because some of the fields of the mapping will be zero, and it's not valid for them to be zero. So it'll skip the mapping.

A reader that already located the mapping and is polling for updates:

If it observes at step i reads the old content correctly

If it observes at step ii reads the old content correctly

If it observes at step iii/iv will find the header to be invalid (zeroed signature) and will know an update is ongoing/the context is not valid

If it observes at step v/vi the signature is still invalid

As the usual publish protocol goes down, only after 7 will the mapping the valid again

Furthermore we already state that the header gets read twice, and this would made sure that if the reader reads the full old header, and then while it's reading the payload the new update starts, then when the reader looks at the header again it'll see that it's not the same as the old one (fields are either different value or zero).

(This approach supports the payload-after-header because the payload only starts being modified after zeroing out the header and thus the reader can tell, when it re-reads the header, that it's no longer valid)

Thoughts?

Isn't the counter example a lot simpler for publish/update/read as:

We don't need to modify the signature (no zero out and write again cost)

We don't need to modify the name (no remove and write again cost)

We don't need to zero out the fields (timestamp)

We only need to read the counter again (on the publisher side: very cheap 64bit write operation) to establish if fetching the update was complete / not interrupted

Also if memfd_create is viable, we could (in addition to the above) get rid of the mprotect operations. This would give us a protocol that we could also leverage / build upon for (possibly higher-frequency) thread context updates.

Isn't the counter example a lot simpler for publish/update/read as:

We don't need to modify the signature (no zero out and write again cost)

We don't need to modify the name (no remove and write again cost)

We don't need to zero out the fields (timestamp)

We only need to read the counter again (on the publisher side: very cheap 64bit write operation) to establish if fetching the update was complete / not interrupted

Compared to the counter approach, the "do a few things backwards approach":

Allows us to end up with the expected mprotect flags at the end, meaning finding the context is still as cheap as in the current OTEP doc

It still allows hooking on prctl to detect updates

It does not require changes to readers -- the existing spec covers these kind of operations too

For these reasons, I think the "do a few things backwards approach" fits a bit better, but happy to discuss/flesh it out if you're not convinced.

Also if memfd_create is viable, we could (in addition to the above) get rid of the mprotect operations. This would give us a protocol that we could also leverage / build upon for (possibly higher-frequency) thread context updates.

I'll comment on memfd as an alternative separately, there's a different set of trade-offs for that one.

oteps/profiles/4719-process-ctx.md

Co-authored-by: Christos Kalkanis <christos.kalkanis@elastic.co>

…n Linux 5.17+ See open-telemetry/sig-profiling#23 for a wider discussion of this.

See also open-telemetry/opentelemetry-specification#4719 (comment)

…e aligned

This PR adds an experimental C/C++ implementation for the "Process Context" OTEP being proposed in open-telemetry/opentelemetry-specification#4719 This implementation previously lived in https://github.com/ivoanjo/proc-level-demo/tree/main/anonmapping-clib and as discussed during the OTEL profiling SIG meeting we want to add it to this repository so it becomes easier to find and contribute to. I've made sure to include a README explaining how to use it. Here's the ultra-quick start (Linux-only): ```bash $ ./build.sh $ ./build/example_ctx --keep-running Published: service=my-service, instance=123d8444-2c7e-46e3-89f6-6217880f7123, env=prod, version=4.5.6, sdk=example_ctx.c/c/1.2.3, resources=resource.key1=resource.value1,resource.key2=resource.value2 Continuing forever, to exit press ctrl+c... TIP: You can now `sudo ./otel_process_ctx_dump.sh 267023` to see the context # In another shell $ sudo ./otel_process_ctx_dump.sh 267023 # Update this to match the PID from above Found OTEL context for PID 267023 Start address: 756f28ce1000 00000000 4f 54 45 4c 5f 43 54 58 02 00 00 00 0b 68 55 47 |OTEL_CTX.....hUG| 00000010 70 24 7d 18 50 01 00 00 a0 82 6d 7e 6a 5f 00 00 |p$}.P.....m~j_..| 00000020 Parsed struct: otel_process_ctx_signature : "OTEL_CTX" otel_process_ctx_version : 2 otel_process_ctx_published_at_ns : 1764606693650819083 (2025-12-01 16:31:33 GMT) otel_process_payload_size : 336 otel_process_payload : 0x00005f6a7e6d82a0 Payload dump (336 bytes): 00000000 0a 25 0a 1b 64 65 70 6c 6f 79 6d 65 6e 74 2e 65 |.%..deployment.e| 00000010 6e 76 69 72 6f 6e 6d 65 6e 74 2e 6e 61 6d 65 12 |nvironment.name.| ... Protobuf decode: attributes { key: "deployment.environment.name" value { string_value: "prod" } } attributes { key: "service.instance.id" value { string_value: "123d8444-2c7e-46e3-89f6-6217880f7123" } } attributes { key: "service.name" value { string_value: "my-service" } } ... ``` Note that because the upstream OTEP is still under discussion, this implementation is experimental and may need changes to match up with the final version of the OTEP. --------- Co-authored-by: Christos Kalkanis <christos.kalkanis@elastic.co>

felixge

Left a comment, but overall LGTM. Happy to approve once the open discussions threads have been resolved.

felixge · 2025-12-10T06:52:28Z

oteps/profiles/4719-process-ctx.md

+5. **Write header fields**: Populate `version`, `published_at_ns`, `payload_size`, `payload`
+6. **Memory barrier**: Use language/compiler-specific techniques to ensure all previous writes complete before proceeding
+7. **Write signature**: Write `OTEL_CTX` to the signature field last
+8. **Set read-only**: Apply `mprotect(..., PROT_READ)` to mark the mapping as read-only


Why does the signature need to be written after the memory barrier? Shouldn't the transition to PROT_READ status be atomic? If that's guaranteed to be ordered after all writes to the map, we should be good?

In practice, it should be as you say. In theory... the mprotect is done only on the mapping, not on the payload, so a given language's memory model might be ambiguous on "if you observe the read-only mapping, will the payload be there too" and thus it seems worth it to slightly over-specify it here?

Not an extremely strong reason, I agree -- we could probably simplify this if needed.

christos68k · 2025-12-11T15:33:45Z

In light of our discussion here, I ran some more experiments. Listing possible fallbacks/alternatives to read-only mapping for when PR_SET_VMA_ANON_NAME is not available:

Deterministic address generation scheme (fast check based on address pattern). Relies on MAP_FIXED_NOREPLACE [Kernel 4.17+] (see here for something based on a similar premise I did a long time ago)
memfd_create

Example code for the latter (may need more investigation but it looks promising and should be widely available?).

$ ./memfd
PID: 52591 MMAP at: 0x7fd86f5d5000
Forking...
Child PID: 52592
OTEL reader: OTEL publisher data
7fd86f5d5000-7fd86f5d7000 rw-p 00000000 00:01 6178                       /memfd:OTELCTX (deleted)

EDIT: Using memfd_create and an inline payload, also allows a reader process to (optionally) mmap the target region into its own address space. See updated example code here.

$ ./memfd-mmap 
[Writer] PID: 98784 FD: 3 MMAP at: 0x7fe176430000
Forking...
[Reader] PID: 98785 FD_NUM: 3
[Reader] process_vm_readv: OTEL publisher data
[Reader] mmap: OTEL publisher data
[parent]
7fe176430000-7fe176432000 rw-s 00000000 00:01 6187                       /memfd:OTELCTX (deleted)
[child]
7fe176430000-7fe176432000 r--s 00000000 00:01 6187                       /memfd:OTELCTX (deleted)
[Reader] mmap: OTEL publisher data
[Reader] mmap: OTEL publisher data
[Writer] exit
$ [Reader] mmap: OTEL publisher updated data
[Reader] mmap: OTEL publisher updated data
[Reader] mmap: OTEL publisher updated data
[Reader] exit

ivoanjo · 2025-12-15T15:27:04Z

In light of our discussion here, I ran some more experiments. Listing possible fallbacks/alternatives to read-only mapping for when PR_SET_VMA_ANON_NAME is not available:

memfd_create

Example code for the latter (may need more investigation but it looks promising and should be widely available?).

👋 So funny thing you mention memfd 😀. My colleagues at Datadog actually have previously built something close to what this OTEP proposes, using memfd, although in a slightly different way than your gist (of notice, not using mmap together with memfd).

The main reasons why we moved away from it for this OTEP were:

We were concerned that custom seccomp profiles for containery things can/could block memfd
Dealing with forks; although maybe with the "mmap over memfd" that's would maybe no longer be an issue (?)
Dealing with reading: We actually used /proc/pid/fd to find the context, not /proc/pid/maps
Dealing with updates: again since we didn't use "mmap over memfd" updating was different

I think both approaches are quite similar, especially when involving mmap. (E.g. I suspect we could as well mmap the region into the reader with the current approach in the OTEP. I do think we'd need some kind of cleanup mechanism to detect when the owner of a mapping has gone away, as otherwise I suspect the reader will keep the mapping alive?).

So in a way it's more what combination of blocks do we want to use (or mix) 🤔👋:

Start from an anonymous mapping or start from a memfd? or mix/fallback from one to another?
Use mmap or not
How to find the context: name in maps? property of pages? look at fds? (or combination/fallback)
Recreate to update or mutate in place
If mutating in place, how exactly does that mutation work
Dealing with some of the concurrency/forks/some of the other details

We at Datadog spent some time exploring the solution space, and tried to come up with a combination of the above that seemed reasonable, given the constraints (and tried to document that as well).

But yeah, I won't say it's not possible to do any of the above in a slightly different way, especially given most have trade-offs and there's not been a very clear above-the-rest winner on most points. 😅

christos68k · 2025-12-15T23:24:45Z

But yeah, I won't say it's not possible to do any of the above in a slightly different way, especially given most have trade-offs and there's not been a very clear above-the-rest winner on most points. 😅

I think we can come up with something that's flexible but also remains simple for the simple use-cases. The main advantage of memfd_create is that we won't need a page perm RO or different page property fallback as it's available on all kernel versions we care about. Secondarily, it allows for easy mmap in a reader process (alternative ways to do that are an actual file on the filesystem which is tricky with containers or shared memory of some sort, whether SystemV / POSIX).

To support wakeups instead of polling, without hooking prctl, we can use eventfd or even futex (these are also not tied to memfd_create, can be used with any of the other options we discussed).

I think that the scheme we end up with in OTel should at least meet the following three criteria:

As simple as it gets
Doesn't recreate the mapping on every update (allows readers to cache mapping address and skip /proc after context is first established)
Doesn't rely on hooking prctl for one-to-many reader wakeups (but also, doesn't require polling)

Based on all the options we laid out, I think that's doable.

Optionally (we probably need to expand the scope to thread context to figure out requirements / pick through the following):

Allow for reader to mmap the region (also means mapping needs to be MAP_SHARED instead of MAP_PRIVATE [2])
Inline payload updates (if we allow mmap this becomes a requirement)
Allow for variable (high/low) frequency of updates

[1] Regarding forking, there's a race condition between mmap and madvise(MADV_DONTFORK) which may infrequently manifest, as we don't control the code running inside the publisher process (meaning, we can not avoid a fork taking place after our mmap and before our madvise). However, if we start with a MAP_PRIVATE mapping and only write the fixed header after madvise, we can guarantee that the inherited mapping will never pass verification in a forked process and thus will be skipped by readers.

[2] If we allow mmap, we need to use a MAP_SHARED mapping. AFAIK it's not possible in Linux to start with a private mapping, call madvise, switch to a shared mapping and have madvise take effect for the latter mapping (unmapping the first mapping will destroy the VMA that madvise affected). Instead, we can add a PID field to the fixed header that readers can use during verification to skip the mapping.

christos68k · 2025-12-18T12:07:05Z

Update for extra context: @ivoanjo and me had a Zoom sync today where we talked about simplifying the current proposal by:

Defaulting to memfd_create and having PR_SET_VMA_ANON_NAME as a fallback (@ivoanjo discovered that prctl can still be used with memfd mappings which means we can have a consistent name through both approaches)
Removing the search for the mapping based on read-only status
Removing the need to flip mapping between read-only and read-write
Keeping the mapping fixed in memory instead of recreating it on each update resulting in different address
For lock-free updates that maintain payload consistency, use a counter scheme. To avoid introducing an extra field just for the counter, use the existing timestamp (TODO: unixtime isn't strictly monotonic but this shouldn't affect the scheme).
Keep prctl as a notification method (it's not strictly needed on the part of the readers which can choose to ignore it and poll at their own frequency)

For future discussion:

Clarify if we need/want to allow reader mmap (introduces inline payload requirement)
Clarify if we need/want alternative one-to-many userspace-only notification mechanism (e.g. eventfd)

@christos68k

… find mappings After discussion in the PR and great suggestions/experiments from @christos68k, the specification has been updated as such: * Instead of always using an anonymous mapping, try first to create a memfd and create a mapping from the memfd. If due to security restrictions memfd is not available, fall back to an anonymous mapping instead. * Remove probing as a fallback for when naming a mapping fails. Because the name of a memfd also shows up in `/proc/<pid>/maps`, we expect that having `memfd` naming as a fallback for when `prctl` is not available is enough. * Drop requirement for 2-page size and read-only permissions on the header memory pages. These were intented to support the "probing as a fallback for naming failure", so they are no longer needed. * Document "Updating Protocol" for in-place updates to process context. This allows efficient updates. In particular, it makes it easier for the reader to detect updates and avoids reparsing `/proc/<pid>/maps` for updates.

ivoanjo · 2025-12-18T15:33:02Z

I've pushed 3caecfb with the changes described/discussed with @christos68k above.

I'm preparing a PR to update the reference C/C++ implementation to match this change; I'll share that one shortly.

ivoanjo · 2025-12-18T15:38:46Z

The update to the reference C/C++ implementation is in open-telemetry/sig-profiling#34 .

As a final quick note I + possibly other folks are going to be out for holidays the next few weeks, so expect discussions to slow down for a bit until we're back in full force in January!

ivoanjo marked this pull request as ready for review November 5, 2025 12:19

ivoanjo requested review from a team as code owners November 5, 2025 12:19

ivoanjo and others added 3 commits November 5, 2025 12:35

Markdownlint fixes (almost all whitespace)

f1c93f0

Update OTEP number based on PR number

967067a

Merge branch 'main' into ivoanjo/profiling-process-ctx

825fdb5

github-actions bot added the Stale label Nov 15, 2025

florianl reviewed Nov 17, 2025

View reviewed changes

github-actions bot removed the Stale label Nov 18, 2025

Apply suggestions from code review

e823eb4

Co-authored-by: Florian Lehner <florianl@users.noreply.github.com>

dashpole reviewed Nov 25, 2025

View reviewed changes

oteps/profiles/4719-process-ctx.md Outdated Show resolved Hide resolved

tigrannajaryan reviewed Nov 25, 2025

View reviewed changes

oteps/profiles/4719-process-ctx.md Outdated Show resolved Hide resolved

oteps/profiles/4719-process-ctx.md Outdated Show resolved Hide resolved

pellared reviewed Nov 25, 2025

View reviewed changes

ivoanjo added 2 commits November 26, 2025 10:31

Document "loose coordination" intent

3fee352

Rework payload to use Resource message

0d73345

Following discussion so far, we can probably avoid having our home-grown `OtelProcessCtx` and instead use the common OTEL `Resource` message.

ivoanjo mentioned this pull request Dec 1, 2025

Add development protobuf definition for process context open-telemetry/sig-profiling#13

Closed

ivoanjo mentioned this pull request Dec 1, 2025

Add experimental C/C++ implementation for process context OTEP open-telemetry/sig-profiling#23

Merged

ivoanjo added 2 commits December 1, 2025 17:00

Omit recommended attributes

5fbcb8c

As pointed out during review, these don't necessarily exist for some resources so let's streamline the spec for now.

Update link to C/C++ example implementation

c5989b8

florianl reviewed Dec 5, 2025

View reviewed changes

christos68k reviewed Dec 5, 2025

View reviewed changes

oteps/profiles/4719-process-ctx.md Outdated Show resolved Hide resolved

Tweak description of cross-signal identifiers

b1583c6

ivoanjo and others added 5 commits December 8, 2025 10:14

Apply suggestions from code review

d84adea

Co-authored-by: Christos Kalkanis <christos.kalkanis@elastic.co>

Clarify limitations regarding OBI

9c8d9ed

Update spec to reflect that named mappings are not always available o…

4c871ac

…n Linux 5.17+ See open-telemetry/sig-profiling#23 for a wider discussion of this.

Minor: Linting fix

a647b8a

Merge branch 'main' into ivoanjo/profiling-process-ctx

dae00f9

ivoanjo added a commit to ivoanjo/sig-profiling that referenced this pull request Dec 8, 2025

Reorder struct fields as per upstream spec discussion

6963854

See also open-telemetry/opentelemetry-specification#4719 (comment)

Reorder fields to make sure published_at_ns and payload are 8-byt…

f00d9f9

…e aligned

felixge reviewed Dec 10, 2025

View reviewed changes

ivoanjo added 2 commits December 10, 2025 10:06

Add mention to custom attributes + suggest following existing semconv

17ec933

Minor: Tighten intro by focusing on resources and less on what they are

81169a1

Minor: Make it clear why the fallback is there

5e1956a

ivoanjo mentioned this pull request Dec 18, 2025

Update experimental C/C++ implementation for process context OTEP with memfd + in-place modification open-telemetry/sig-profiling#34

Open


		External readers like the OpenTelemetry eBPF Profiler operate outside the instrumented process and cannot access resource attributes configured within OpenTelemetry SDKs. This creates several problems:

		- Missing cross-signal correlation identifiers: Runtime-generated attributes ([`service.instance.id`](https://opentelemetry.io/docs/specs/semconv/registry/attributes/service/#service-instance-id) being a key example) are often inaccessible to external readers, making it hard to correlate profiles with other telemetry (such as traces and spans!) from the same service instance (especially in runtimes that employ multiple processes).

	\| `version` \| `uint32` \| Format version. Currently `2` (`1` was used for development) \|
	\| `version` \| `uint32` \| Format version. Currently `1`. \|


		### Publication Protocol

		Publishing the context should follow these steps:


		When an SDK initializes (or updates its resource attributes) it publishes this information to a small, fixed-size memory region that external processes can discover and read.

		The OTEL eBPF profiler will then, upon observing a previously-unseen process, probe and read this information, associating it with any profiling samples taken from a given process.


		- Inconsistent resource attributes across signals: Running in different scopes, configuration such as `service.name`, `deployment.environment.name`, and `service.version` are not always available or resolves consistently between the OpenTelemetry SDKs and external readers, leading to configuration drift and inconsistent tagging.

		- Correlation is dependent on process activity: If a service is blocked (such as when doing slow I/O, or threads are actually deadlocked) and not emitting other signals, external readers have difficulty identifying it, since resource attributes or identifiers are only sent along when signals are reported.


		Publishing the context should follow these steps:

		1. Drop existing mapping: If a previous context was published, unmap/free it

OTEP: Process Context: Sharing Resource Attributes with External Readers #4719

Are you sure you want to change the base?

OTEP: Process Context: Sharing Resource Attributes with External Readers #4719

Uh oh!

Conversation

ivoanjo commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Uh oh!

ivoanjo commented Nov 5, 2025

Uh oh!

tsloughter commented Nov 5, 2025

Uh oh!

ivoanjo commented Nov 6, 2025

Uh oh!

carlosalberto commented Nov 7, 2025

Uh oh!

github-actions bot commented Nov 15, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

christos68k Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

christos68k Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ivoanjo Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

ivoanjo commented Oct 31, 2025 •

edited

Loading

christos68k Dec 5, 2025 •

edited

Loading

christos68k Dec 8, 2025 •

edited

Loading

ivoanjo Dec 8, 2025 •

edited

Loading

christos68k Dec 8, 2025 •

edited

Loading

ivoanjo Dec 11, 2025 •

edited

Loading

christos68k Dec 12, 2025 •

edited

Loading

christos68k commented Dec 11, 2025 •

edited

Loading

christos68k commented Dec 15, 2025 •

edited

Loading