Skip to content

Conversation

@ivoanjo
Copy link

@ivoanjo ivoanjo commented Oct 31, 2025

Changes

External readers like the OpenTelemetry eBPF Profiler operate outside the instrumented process and cannot access resource attributes configured within OpenTelemetry SDKs. We propose a mechanism for OpenTelemetry SDKs to publish process-level resource attributes, through a standard format based on Linux anonymous memory mappings.

When an SDK initializes (or updates its resource attributes) it publishes this information to a small, fixed-size memory region that external processes can discover and read. The OTEL eBPF profiler will then, upon observing a previously-unseen process, probe and read this information, associating it with any profiling samples taken from a given process.

Why open as draft: I'm opening this PR as a draft with the intention of sharing with the Profiling SIG for an extra round of feedback before asking for a wider review.

This OTEP is based on Sharing Process-Level Resource Attributes with the OpenTelemetry eBPF Profiler, big thanks to everyone that provided feedback and helped refine the idea so far.

This OTEP introduces a standard mechanism for OpenTelemetry SDKs to
publish process-level resource attributes for access by out-of-process
readers such as the OpenTelemetry eBPF Profiler.

External readers like the OpenTelemetry eBPF Profiler operate outside
the instrumented process and cannot access resource attributes
configured within OpenTelemetry SDKs.

We propose a mechanism for OpenTelemetry SDKs to publish process-level
resource attributes, through a standard format based on Linux anonymous
memory mappings.

When an SDK initializes (or updates its resource attributes) it
publishes this information to a small, fixed-size memory region that
external processes can discover and read.

The OTEL eBPF profiler will then, upon observing a previously-unseen
process, probe and read this information, associating it with any
profiling samples taken from a given process.

_I'm opening this PR as a draft with the intention of sharing with
the Profiling SIG for an extra round of feedback before asking for a
wider review._

_This OTEP is based on
[Sharing Process-Level Resource Attributes with the OpenTelemetry eBPF Profiler](https://docs.google.com/document/d/1-4jo29vWBZZ0nKKAOG13uAQjRcARwmRc4P313LTbPOE/edit?tab=t.0),
big thanks to everyone that provided feedback and helped refine the
idea so far._
@ivoanjo
Copy link
Author

ivoanjo commented Nov 5, 2025

Marking as ready for review!

@ivoanjo ivoanjo marked this pull request as ready for review November 5, 2025 12:19
@ivoanjo ivoanjo requested review from a team as code owners November 5, 2025 12:19
@tsloughter
Copy link
Member

So this would be a new requirement for eBPF profiler implementations?

My issue is the lack of safe support for Erlang/Elixir to do this. While something that could just be accessed as a file or socket wouldn't have that issue. We'd have to pull in a third party, or implement ourselves, library that is a NIF to make these calls and that brings in instability many would rather not have when the goal of our SDK is to not be able to bring down a users program if the SDk crashes -- unless they specifically configure it to do so.

@ivoanjo
Copy link
Author

ivoanjo commented Nov 6, 2025

So this would be a new requirement for eBPF profiler implementations?

No, hard requirement should not be the goal: for starters, this is Linux-only (for now), so right off the gate this means it's not going to be available everywhere.

Having this discussion is exactly why it was included as one of the open questions in the doc 👍


Our view is that we should go for recommended to implement and recommended to enable by default.

In languages/runtimes where it's easy to do so (Go, Rust, Java 22+, possibly Ruby, ...etc?) we should be able to deliver this experience.

For others, such as Erlang/Elixir, Java 8-21 (requires a native library, similar to Erlang/Elixir), the goal would be to make it very easy to enable/use for users that want it, but still optional so as to not impact anyone that is not interested.

We should probably record the above guidance on the OTEP, if/once we're happy with it 🤔

@carlosalberto
Copy link
Contributor

cc @open-telemetry/specs-entities-approvers for extra eyes

@github-actions
Copy link

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@github-actions github-actions bot added the Stale label Nov 15, 2025

External readers like the OpenTelemetry eBPF Profiler operate outside the instrumented process and cannot access resource attributes configured within OpenTelemetry SDKs. This creates several problems:

- **Missing cross-signal correlation identifiers**: Runtime-generated attributes ([`service.instance.id`](https://opentelemetry.io/docs/specs/semconv/registry/attributes/service/#service-instance-id) being a key example) are often inaccessible to external readers, making it hard to correlate profiles with other telemetry (such as traces and spans!) from the same service instance (especially in runtimes that employ multiple processes).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- **Missing cross-signal correlation identifiers**: Runtime-generated attributes ([`service.instance.id`](https://opentelemetry.io/docs/specs/semconv/registry/attributes/service/#service-instance-id) being a key example) are often inaccessible to external readers, making it hard to correlate profiles with other telemetry (such as traces and spans!) from the same service instance (especially in runtimes that employ multiple processes).
- **Missing cross-signal correlation identifiers**: Runtime-generated attributes ([`service.instance.id`](https://opentelemetry.io/docs/specs/semconv/registry/attributes/service/#service-instance-id) being a key example) are often inaccessible to external readers, making it hard to correlate various signals with each other).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about keeping the comment about the runtimes with multiple processes? I think that's one good use-case where it's especially hard to map what multiple pids seen from the outside actually are.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tweaked the description here in b1583c6

| Field | Type | Description |
|-------------------|-----------|----------------------------------------------------------------------|
| `signature` | `char[8]` | Set to `"OTEL_CTX"` when the payload is ready (written last) |
| `version` | `uint32` | Format version. Currently `2` (`1` was used for development) |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Development versions should not matter at this point as this OTEP is the point of introduction. All previous work is just for experimentation.

Suggested change
| `version` | `uint32` | Format version. Currently `2` (`1` was used for development) |
| `version` | `uint32` | Format version. Currently `1`. |

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Starting at 2 would make it really easy to distinguish from the earlier experiments that we deployed in a lot of spots already...

Since there's space for uint32 different versions, do you see starting at 2 as a big blocker? (I can still remove the comment explaining what 1 was, I agree it's TMI)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Starting at 2 is not a blocker to me. It just feels strange that this OTel protocol starts at 2.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's slightly annoying that in most cases v0 is the development one, but in this case we are reserving 0 to "not filled in yet" which is why 1 ended up being the development version.


### Publication Protocol

Publishing the context should follow these steps:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As context sharing provides also an opportunity for others, what is the idea for other OS than Linux (or more general OS that don't have a mmap syscall).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For windows, we've experimented at Datadog with using an in-memory file. For macOS it's a bit more nebulous: we can still use mmap, and maybe combine it with mach_vm_region to discover the region?

While this mechanism can be extended to other OS's in the future, our thinking so far was that since the eBPF profiler is Linux-only, the main focus should be on getting Linux support in really amazing shape and then later extend as-needed.

8. **Set read-only**: Apply `mprotect(..., PROT_READ)` to mark the mapping as read-only
9. **Name mapping** (Linux ≥5.17): Use `prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, ..., "OTEL_CTX")` to name the mapping

The signature MUST be written last to ensure readers never observe incomplete or invalid data. Once the signature is present and the mapping set to read-only, the entire mapping is considered valid and immutable.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it simplify the publication protocol to require the writer to set published_at_ns to a time in the future, when writing the data is guaranteed to be finished?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. In theory a "malicious"/buggy/overloaded scheduler could always schedule out the thread after writing the timestamp and before it finished the rest of the steps...

One really nice property is that the pages are zeroed out by the kernel so it shouldn't be possible to observe anything else other than zeroes or valid data.

@github-actions github-actions bot removed the Stale label Nov 18, 2025
Co-authored-by: Florian Lehner <florianl@users.noreply.github.com>

When an SDK initializes (or updates its resource attributes) it publishes this information to a small, fixed-size memory region that external processes can discover and read.

The OTEL eBPF profiler will then, upon observing a previously-unseen process, probe and read this information, associating it with any profiling samples taken from a given process.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please describe how it would/could/(or won't) work when an application is instrumented with OBI (https://github.com/open-telemetry/opentelemetry-ebpf-instrumentation)?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this question!

I researched this and my conclusion is that right now this won't work with OBI.

From what I'm seeing, while it's possible for ebpf programs to write into userspace using bpf_probe_write_user (and this is already used by OBI to support GO tracing), I don't see a way to do the other things listed in the publication protocol, such as allocating (small amounts of) memory, or invoking system calls to set up the naming and the inheritance permissions.

That said, I don't think this would necessarily be a blocker for OBI-to-OTEL eBPF Profiler communication, since we could introduce a specific out-of-band channel between them using the existing kernel eBPF primitives; but given the current limitations of eBPF I don't think we can get OBI to implement this specification on behalf of an instrumented application.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please document it in the OTEP?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in 9c8d9ed

Following discussion so far, we can probably avoid having our home-grown
`OtelProcessCtx` and instead use the common OTEL `Resource` message.
ivoanjo added a commit to ivoanjo/sig-profiling that referenced this pull request Dec 1, 2025
This PR adds an experimental C/C++ implementation for the "Process
Context" OTEP being proposed in
open-telemetry/opentelemetry-specification#4719

This implementation previously lived in
https://github.com/ivoanjo/proc-level-demo/tree/main/anonmapping-clib
and as discussed during the OTEL profiling SIG meeting we want to add
it to this repository so it becomes easier to find and contribute to.

I've made sure to include a README explaining how to use it. Here's
the ultra-quick start (Linux-only):

```bash
$ ./build.sh
$ ./build/example_ctx --keep-running
Published: service=my-service, instance=123d8444-2c7e-46e3-89f6-6217880f7123, env=prod, version=4.5.6, sdk=example_ctx.c/c/1.2.3, resources=resource.key1=resource.value1,resource.key2=resource.value2
Continuing forever, to exit press ctrl+c...
TIP: You can now `sudo ./otel_process_ctx_dump.sh 267023` to see the context

 # In another shell
$ sudo ./otel_process_ctx_dump.sh 267023 # Update this to match the PID from above
Found OTEL context for PID 267023
Start address: 756f28ce1000
00000000  4f 54 45 4c 5f 43 54 58  02 00 00 00 0b 68 55 47  |OTEL_CTX.....hUG|
00000010  70 24 7d 18 50 01 00 00  a0 82 6d 7e 6a 5f 00 00  |p$}.P.....m~j_..|
00000020
Parsed struct:
  otel_process_ctx_signature       : "OTEL_CTX"
  otel_process_ctx_version         : 2
  otel_process_ctx_published_at_ns : 1764606693650819083 (2025-12-01 16:31:33 GMT)
  otel_process_payload_size        : 336
  otel_process_payload             : 0x00005f6a7e6d82a0
Payload dump (336 bytes):
00000000  0a 25 0a 1b 64 65 70 6c  6f 79 6d 65 6e 74 2e 65  |.%..deployment.e|
00000010  6e 76 69 72 6f 6e 6d 65  6e 74 2e 6e 61 6d 65 12  |nvironment.name.|
...
Protobuf decode:
attributes {
  key: "deployment.environment.name"
  value {
    string_value: "prod"
  }
}
attributes {
  key: "service.instance.id"
  value {
    string_value: "123d8444-2c7e-46e3-89f6-6217880f7123"
  }
}
attributes {
  key: "service.name"
  value {
    string_value: "my-service"
  }
}
...
```

Note that because the upstream OTEP is still under discussion, this
implementation is experimental and may need changes to match up with
the final version of the OTEP.
As pointed out during review, these don't necessarily exist for some
resources so let's streamline the spec for now.
option go_package = "go.opentelemetry.io/proto/otlp/resource/v1";

// Resource information.
message Resource {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late question - but this just popped into my mind:

What is the idea of going forward using message Resource for sharing thread state information or more process internals?

Iirc this approach should also be used later on to provide more information about process internals. But Resource.attributes only holds information covered by OTel Semantic Convention.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the idea of going forward using message Resource for sharing thread state information or more process internals?

I suspect protobuf will be a bit too heavy/awkward for the thread state payload format BUT my thinking is that anything we put there should otherwise map to/from attributes.

But Resource.attributes only holds information covered by OTel Semantic Convention.

Actually I don't think that's the case? I've seen a lot of prior art for custom attributes, so anything we don't think should end up in semantic conventions could stay as a custom attribute. I think? 👀

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a second pass, inspired by https://opentelemetry.io/docs/concepts/resources/#custom-resources I've added a note about custom attributes in 17ec933


- **Inconsistent resource attributes across signals**: Running in different scopes, configuration such as `service.name`, `deployment.environment.name`, and `service.version` are not always available or resolves consistently between the OpenTelemetry SDKs and external readers, leading to configuration drift and inconsistent tagging.

- **Correlation is dependent on process activity**: If a service is blocked (such as when doing slow I/O, or threads are actually deadlocked) and not emitting other signals, external readers have difficulty identifying it, since resource attributes or identifiers are only sent along when signals are reported.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this relevant to the issue we're trying to solve with this OTEP meaning isn't this problem still going to exist with eBPF profiler even if we adopt the proposed mechanism? Maybe add a clarification that for eBPF profiler this behavior is unaffected by the proposed mechanism?

(I don't think we should remove it as it's contextual information but as it's currently listed in Motivation there's room for misunderstanding)

If there's something else you had in mind re: different external reader, feel free to clarify.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thinking behind this point is two-fold:

  1. off-cpu/wall-time profiling

    My thinking is since the OTEL eBPF profiler already supports off-cpu profiles, for such samples, we would add support for including the process context as well.

    +1 that indeed "can read even when there's no activity" would not impact CPU profiling, since CPU profiling is only concerned about activity.

    If, in the future, wall-time profiling (e.g. a combination of on-cpu and off-cpu) was added to the OTEL eBPF profiler, that would be another use-case for this mechanism.

  2. non-reliance on mechanisms that require activity from the application

    If we were to try to solve the process context problem with an approach of having the application calling something from time to time (or once/a few times, after handshaking with the reader), such a solution would be fragile in the presence of applications that are blocked/stuck, if the application for some reason stops performing those calls.

    The current solution is not affected by this since the process context setup is intended to be performed once at application start, in a fire-and-forget way, independently of what the reader is doing.


Publishing the context should follow these steps:

1. **Drop existing mapping**: If a previous context was published, unmap/free it
Copy link
Member

@christos68k christos68k Dec 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to drop the existing mapping? If we keep it fixed, the reader may cache the address for the target process which simplifies checking if the data has been updated (no overhead of re-parsing mappings, this can also help with higher-frequency updates).

Since the payload pointer can point to anywhere in target process memory, we'll never be limited by the two pages fixed mapping size (meaning we don't need to grow this mapping to span more pages either during process runtime or in the future).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to drop the existing mapping?

Not strictly?

For the existing approach, it's possible to avoid polling mappings to figure out the address by:

  • Checking that published_at_ns can be read and hasn't changed and/or
  • Hooking on prctl calls

Reusing the mapping instead of dropping it does not conflict with the above approaches, but... I think it would complicate concurrency control on the reader. That is, having this invariant allows the reader to reader know that while the mapping is up, the payload is valid and consistent as far as the writer is concerned.

Copy link
Member

@christos68k christos68k Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we allow the mapping address to change, then checking published_at_ns is not reliable with the existing protocol (for example, we'd need to overwrite with zeroes).

But the point I'm making is more general: The current update protocol mentions that the "previous mapping should be removed" before publishing new ones. If we assume that most implementors abide by this, then the overhead of parsing mappings will be there. For a reader like eBPF profiler that may have to manage hundreds of processes as a worst case, that overhead of constantly hitting /proc could be significant.

Can we examine the concurrency control edge cases in more detail? It should be possible to provide the same guarantees as now while keeping a fixed mapping.

We should avoid relying on hooking prctl IMO (it also doesn't solve the constant /proc access problem if most implementors change the mapping on every update)

Copy link
Author

@ivoanjo ivoanjo Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we allow the mapping address to change, then checking published_at_ns is not reliable with the existing protocol (for example, we'd need to overwrite with zeroes).

I think it is! Let me try to convince you ;)

After a context gets dropped one of two things happens:

a) The mapping becomes invalid. This would make reads return an error, which would be a clear indication of not valid.

b) A new mapping (otel or not) gets put in its place. Reads to the old location of published_at_ns would return whatever's there now. Note that this would would not be published_at_ns because the kernel zeroes out memory before mapping it (e.g. this is not regular malloc/free) and thus I don't think it's possible for leftover garbage to exist to confuse the reader. (Edit: And thus the reader will know what it read is not valid)

For a reader like eBPF profiler that may have to manage hundreds of processes as a worst case, that overhead of constantly hitting /proc could be significant.

The strategy above means that once a process context is detected, we can continue to detect its presence by reading the same address in a cheap way, even without hooking prctl.

(The reader may even choose to do time-based caching, e.g. read the context and reuse it for the next N seconds/minutes, rather than trying to always have the latest up-to-date version if it wants to even save more reads)

We should avoid relying on hooking prctl IMO (it also doesn't solve the constant /proc access problem if most implementors change the mapping on every update)

To be clear, I believe prctl is not needed at all to be able to follow invalidation of existing contexts/creation of new ones, it's a fully optional possibility.

We could even completely omit references to hooking on prctl in the current spec -- but I think it's an interesting feature to document in the spec for readers that want to use it.

Copy link
Member

@christos68k christos68k Dec 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The strategy above means that once a process context is detected, we can continue to detect its presence by reading the same address in a cheap way, even without hooking prctl.

True, that speeds up detection but the overhead of parsing mappings before fetching an update is still there. It also makes for a more complicated update protocol, maybe limiting update operation frequency.

Advantages for keeping the mapping fixed:

  1. Simpler publisher logic
  2. Simpler reader logic
  3. Minimal (non-existent after mapping is first detected in the reader) accessing and processing /proc/ overhead
  4. Scales to thousands of processes
  5. Scales to higher frequency updates, minimizing possibility of stale data

Can we clarify the disadvantages?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I agree this or something like this can be made to work. (I didn't quite get the reference to mprotect? Do you mean just "I'm omitting the mprotect parts?")

Yeah, to keep the focus on the lock-free data exchange part.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got some numbers to ground my assumptions in actuality and yes it does seem like the extra condition will make a difference (so flipping RO status or a different way to cut down on number of mappings would be needed):

We did experiment at the beginning with making the permissions no read, no write, execute, which is a really odd combination (and thus very very rare) and almost got away with it but discovered that there's two paths for reading memory from another process in the Linux kernel and process_vm_readv actually goes through the path that respects page permissions, which made this approach a bit more awkward.

Another option that avoids mprotect can use MAP_FIXED and an address generating scheme based on a deterministic pattern. We have terabytes of mostly unused address space to play with.

Copy link
Author

@ivoanjo ivoanjo Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been staring at my notebook for a while and maybe have an idea for making "reusing the mapping" work.

Considering the current "publication protocol" in the spec:

Publishing the context should follow these steps:

  1. Drop existing mapping: If a previous context was published, unmap/free it
  2. Allocate new mapping: Create a 2-page anonymous mapping via mmap() (These pages are always zeroed by Linux)
  3. Prevent fork inheritance: Apply madvise(..., MADV_DONTFORK) to prevent child processes from inheriting stale data
  4. Encode payload: Serialize the payload message using protobuf (storing it either following the header OR in a separate memory allocation)
  5. Write header fields: Populate version, published_at_ns, payload_size, payload
  6. Memory barrier: Use language/compiler-specific techniques to ensure all previous writes complete before proceeding
  7. Write signature: Write OTEL_CTX to the signature field last
  8. Set read-only: Apply mprotect(..., PROT_READ) to mark the mapping as read-only
  9. Name mapping: Use prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, ..., "OTEL_CTX") to name the mapping. This step should be done unconditionally, although naming mappings is not always supported by the kernel.

We can document the update protocol as being not destroy old mapping + create new one as you suggested. Update: AND, instead do an in-place update as described below.

To support it, we could go through the publishing steps in reverse. That is, a process that wants to update its mappings goes like this:

i. Undo 9 -- removes name
ii. Undo 8 -- sets memory back to R/W
iii. Undo 7 -- zeroes signature
iv. Barrier, like in 6
v. Undo 5 -- zero fields
vi. Barrier
vii. Then start again from 4 as if this was a new mapping

I believe in this case the reading protocol would not need to change, since it already says

  1. Validate signature and version:

    • Read the header and verify first 8 bytes matches OTEL_CTX
    • Read the version field and verify it is supported (currently 2)
    • If either check fails, skip this mapping
  2. Read payload: Read payload_size bytes starting after the header

  3. Re-read header: If the header has not changed, the read of header + payload is consistent. This ensures there were no concurrent changes to the process context. If the header changed, restart at 1.

A reader that observes the original mapping or the fully updated one will work as expected.

A reader that is trying to locate the mapping will find an invalid mapping because some of the fields of the mapping will be zero, and it's not valid for them to be zero. So it'll skip the mapping.

A reader that already located the mapping and is polling for updates:

  • If it observes at step i reads the old content correctly
  • If it observes at step ii reads the old content correctly
  • If it observes at step iii/iv will find the header to be invalid (zeroed signature) and will know an update is ongoing/the context is not valid
  • If it observes at step v/vi the signature is still invalid
  • As the usual publish protocol goes down, only after 7 will the mapping the valid again

Furthermore we already state that the header gets read twice, and this would made sure that if the reader reads the full old header, and then while it's reading the payload the new update starts, then when the reader looks at the header again it'll see that it's not the same as the old one (fields are either different value or zero).

(This approach supports the payload-after-header because the payload only starts being modified after zeroing out the header and thus the reader can tell, when it re-reads the header, that it's no longer valid)

Thoughts?

Copy link
Member

@christos68k christos68k Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the counter example a lot simpler for publish/update/read as:

  1. We don't need to modify the signature (no zero out and write again cost)
  2. We don't need to modify the name (no remove and write again cost)
  3. We don't need to zero out the fields (timestamp)
  4. We only need to read the counter again (on the publisher side: very cheap 64bit write operation) to establish if fetching the update was complete / not interrupted

Also if memfd_create is viable, we could (in addition to the above) get rid of the mprotect operations. This would give us a protocol that we could also leverage / build upon for (possibly higher-frequency) thread context updates.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the counter example a lot simpler for publish/update/read as:

  1. We don't need to modify the signature (no zero out and write again cost)
  2. We don't need to modify the name (no remove and write again cost)
  3. We don't need to zero out the fields (timestamp)
  4. We only need to read the counter again (on the publisher side: very cheap 64bit write operation) to establish if fetching the update was complete / not interrupted

Compared to the counter approach, the "do a few things backwards approach":

  1. Allows us to end up with the expected mprotect flags at the end, meaning finding the context is still as cheap as in the current OTEP doc
  2. It still allows hooking on prctl to detect updates
  3. It does not require changes to readers -- the existing spec covers these kind of operations too

For these reasons, I think the "do a few things backwards approach" fits a bit better, but happy to discuss/flesh it out if you're not convinced.

Also if memfd_create is viable, we could (in addition to the above) get rid of the mprotect operations. This would give us a protocol that we could also leverage / build upon for (possibly higher-frequency) thread context updates.

I'll comment on memfd as an alternative separately, there's a different set of trade-offs for that one.

felixge pushed a commit to open-telemetry/sig-profiling that referenced this pull request Dec 10, 2025
This PR adds an experimental C/C++ implementation for the "Process
Context" OTEP being proposed in
open-telemetry/opentelemetry-specification#4719

This implementation previously lived in
https://github.com/ivoanjo/proc-level-demo/tree/main/anonmapping-clib
and as discussed during the OTEL profiling SIG meeting we want to add it
to this repository so it becomes easier to find and contribute to.

I've made sure to include a README explaining how to use it. Here's the
ultra-quick start (Linux-only):

```bash
$ ./build.sh
$ ./build/example_ctx --keep-running
Published: service=my-service, instance=123d8444-2c7e-46e3-89f6-6217880f7123, env=prod, version=4.5.6, sdk=example_ctx.c/c/1.2.3, resources=resource.key1=resource.value1,resource.key2=resource.value2
Continuing forever, to exit press ctrl+c...
TIP: You can now `sudo ./otel_process_ctx_dump.sh 267023` to see the context

 # In another shell
$ sudo ./otel_process_ctx_dump.sh 267023 # Update this to match the PID from above
Found OTEL context for PID 267023
Start address: 756f28ce1000
00000000  4f 54 45 4c 5f 43 54 58  02 00 00 00 0b 68 55 47  |OTEL_CTX.....hUG|
00000010  70 24 7d 18 50 01 00 00  a0 82 6d 7e 6a 5f 00 00  |p$}.P.....m~j_..|
00000020
Parsed struct:
  otel_process_ctx_signature       : "OTEL_CTX"
  otel_process_ctx_version         : 2
  otel_process_ctx_published_at_ns : 1764606693650819083 (2025-12-01 16:31:33 GMT)
  otel_process_payload_size        : 336
  otel_process_payload             : 0x00005f6a7e6d82a0
Payload dump (336 bytes):
00000000  0a 25 0a 1b 64 65 70 6c  6f 79 6d 65 6e 74 2e 65  |.%..deployment.e|
00000010  6e 76 69 72 6f 6e 6d 65  6e 74 2e 6e 61 6d 65 12  |nvironment.name.|
...
Protobuf decode:
attributes {
  key: "deployment.environment.name"
  value {
    string_value: "prod"
  }
}
attributes {
  key: "service.instance.id"
  value {
    string_value: "123d8444-2c7e-46e3-89f6-6217880f7123"
  }
}
attributes {
  key: "service.name"
  value {
    string_value: "my-service"
  }
}
...
```

Note that because the upstream OTEP is still under discussion, this
implementation is experimental and may need changes to match up with the
final version of the OTEP.

---------

Co-authored-by: Christos Kalkanis <christos.kalkanis@elastic.co>
Copy link
Member

@felixge felixge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a comment, but overall LGTM. Happy to approve once the open discussions threads have been resolved.

5. **Write header fields**: Populate `version`, `published_at_ns`, `payload_size`, `payload`
6. **Memory barrier**: Use language/compiler-specific techniques to ensure all previous writes complete before proceeding
7. **Write signature**: Write `OTEL_CTX` to the signature field last
8. **Set read-only**: Apply `mprotect(..., PROT_READ)` to mark the mapping as read-only
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does the signature need to be written after the memory barrier? Shouldn't the transition to PROT_READ status be atomic? If that's guaranteed to be ordered after all writes to the map, we should be good?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In practice, it should be as you say. In theory... the mprotect is done only on the mapping, not on the payload, so a given language's memory model might be ambiguous on "if you observe the read-only mapping, will the payload be there too" and thus it seems worth it to slightly over-specify it here?

Not an extremely strong reason, I agree -- we could probably simplify this if needed.

@christos68k
Copy link
Member

christos68k commented Dec 11, 2025

In light of our discussion here, I ran some more experiments. Listing possible fallbacks/alternatives to read-only mapping for when PR_SET_VMA_ANON_NAME is not available:

  • Deterministic address generation scheme (fast check based on address pattern). Relies on MAP_FIXED_NOREPLACE [Kernel 4.17+] (see here for something based on a similar premise I did a long time ago)
  • memfd_create

Example code for the latter (may need more investigation but it looks promising and should be widely available?).

$ ./memfd
PID: 52591 MMAP at: 0x7fd86f5d5000
Forking...
Child PID: 52592
OTEL reader: OTEL publisher data
7fd86f5d5000-7fd86f5d7000 rw-p 00000000 00:01 6178                       /memfd:OTELCTX (deleted)

EDIT: Using memfd_create and an inline payload, also allows a reader process to (optionally) mmap the target region into its own address space. See updated example code here.

$ ./memfd-mmap 
[Writer] PID: 98784 FD: 3 MMAP at: 0x7fe176430000
Forking...
[Reader] PID: 98785 FD_NUM: 3
[Reader] process_vm_readv: OTEL publisher data
[Reader] mmap: OTEL publisher data
[parent]
7fe176430000-7fe176432000 rw-s 00000000 00:01 6187                       /memfd:OTELCTX (deleted)
[child]
7fe176430000-7fe176432000 r--s 00000000 00:01 6187                       /memfd:OTELCTX (deleted)
[Reader] mmap: OTEL publisher data
[Reader] mmap: OTEL publisher data
[Writer] exit
$ [Reader] mmap: OTEL publisher updated data
[Reader] mmap: OTEL publisher updated data
[Reader] mmap: OTEL publisher updated data
[Reader] exit

@ivoanjo
Copy link
Author

ivoanjo commented Dec 15, 2025

In light of our discussion here, I ran some more experiments. Listing possible fallbacks/alternatives to read-only mapping for when PR_SET_VMA_ANON_NAME is not available:

  • memfd_create

Example code for the latter (may need more investigation but it looks promising and should be widely available?).

👋 So funny thing you mention memfd 😀. My colleagues at Datadog actually have previously built something close to what this OTEP proposes, using memfd, although in a slightly different way than your gist (of notice, not using mmap together with memfd).

The main reasons why we moved away from it for this OTEP were:

  • We were concerned that custom seccomp profiles for containery things can/could block memfd
  • Dealing with forks; although maybe with the "mmap over memfd" that's would maybe no longer be an issue (?)
  • Dealing with reading: We actually used /proc/pid/fd to find the context, not /proc/pid/maps
  • Dealing with updates: again since we didn't use "mmap over memfd" updating was different

I think both approaches are quite similar, especially when involving mmap. (E.g. I suspect we could as well mmap the region into the reader with the current approach in the OTEP. I do think we'd need some kind of cleanup mechanism to detect when the owner of a mapping has gone away, as otherwise I suspect the reader will keep the mapping alive?).

So in a way it's more what combination of blocks do we want to use (or mix) 🤔👋:

  • Start from an anonymous mapping or start from a memfd? or mix/fallback from one to another?
  • Use mmap or not
  • How to find the context: name in maps? property of pages? look at fds? (or combination/fallback)
  • Recreate to update or mutate in place
  • If mutating in place, how exactly does that mutation work
  • Dealing with some of the concurrency/forks/some of the other details

We at Datadog spent some time exploring the solution space, and tried to come up with a combination of the above that seemed reasonable, given the constraints (and tried to document that as well).

But yeah, I won't say it's not possible to do any of the above in a slightly different way, especially given most have trade-offs and there's not been a very clear above-the-rest winner on most points. 😅

@christos68k
Copy link
Member

christos68k commented Dec 15, 2025

But yeah, I won't say it's not possible to do any of the above in a slightly different way, especially given most have trade-offs and there's not been a very clear above-the-rest winner on most points. 😅

I think we can come up with something that's flexible but also remains simple for the simple use-cases. The main advantage of memfd_create is that we won't need a page perm RO or different page property fallback as it's available on all kernel versions we care about. Secondarily, it allows for easy mmap in a reader process (alternative ways to do that are an actual file on the filesystem which is tricky with containers or shared memory of some sort, whether SystemV / POSIX).

To support wakeups instead of polling, without hooking prctl, we can use eventfd or even futex (these are also not tied to memfd_create, can be used with any of the other options we discussed).

I think that the scheme we end up with in OTel should at least meet the following three criteria:

  • As simple as it gets
  • Doesn't recreate the mapping on every update (allows readers to cache mapping address and skip /proc after context is first established)
  • Doesn't rely on hooking prctl for one-to-many reader wakeups (but also, doesn't require polling)

Based on all the options we laid out, I think that's doable.

Optionally (we probably need to expand the scope to thread context to figure out requirements / pick through the following):

  • Allow for reader to mmap the region (also means mapping needs to be MAP_SHARED instead of MAP_PRIVATE [2])
  • Inline payload updates (if we allow mmap this becomes a requirement)
  • Allow for variable (high/low) frequency of updates

[1] Regarding forking, there's a race condition between mmap and madvise(MADV_DONTFORK) which may infrequently manifest, as we don't control the code running inside the publisher process (meaning, we can not avoid a fork taking place after our mmap and before our madvise). However, if we start with a MAP_PRIVATE mapping and only write the fixed header after madvise, we can guarantee that the inherited mapping will never pass verification in a forked process and thus will be skipped by readers.

[2] If we allow mmap, we need to use a MAP_SHARED mapping. AFAIK it's not possible in Linux to start with a private mapping, call madvise, switch to a shared mapping and have madvise take effect for the latter mapping (unmapping the first mapping will destroy the VMA that madvise affected). Instead, we can add a PID field to the fixed header that readers can use during verification to skip the mapping.

@christos68k
Copy link
Member

christos68k commented Dec 18, 2025

Update for extra context: @ivoanjo and me had a Zoom sync today where we talked about simplifying the current proposal by:

  • Defaulting to memfd_create and having PR_SET_VMA_ANON_NAME as a fallback (@ivoanjo discovered that prctl can still be used with memfd mappings which means we can have a consistent name through both approaches)
  • Removing the search for the mapping based on read-only status
  • Removing the need to flip mapping between read-only and read-write
  • Keeping the mapping fixed in memory instead of recreating it on each update resulting in different address
  • For lock-free updates that maintain payload consistency, use a counter scheme. To avoid introducing an extra field just for the counter, use the existing timestamp (TODO: unixtime isn't strictly monotonic but this shouldn't affect the scheme).
  • Keep prctl as a notification method (it's not strictly needed on the part of the readers which can choose to ignore it and poll at their own frequency)

For future discussion:

  • Clarify if we need/want to allow reader mmap (introduces inline payload requirement)
  • Clarify if we need/want alternative one-to-many userspace-only notification mechanism (e.g. eventfd)

… find mappings

After discussion in the PR and great suggestions/experiments from
@christos68k, the specification has been updated as such:

* Instead of always using an anonymous mapping, try first to
  create a memfd and create a mapping from the memfd.

  If due to security restrictions memfd is not available, fall
  back to an anonymous mapping instead.

* Remove probing as a fallback for when naming a mapping fails.

  Because the name of a memfd also shows up in `/proc/<pid>/maps`,
  we expect that having `memfd` naming as a fallback for when
  `prctl` is not available is enough.

* Drop requirement for 2-page size and read-only permissions on
  the header memory pages.

  These were intented to support the "probing as a fallback for
  naming failure", so they are no longer needed.

* Document "Updating Protocol" for in-place updates to process
  context.

  This allows efficient updates. In particular, it makes it easier
  for the reader to detect updates and avoids reparsing
  `/proc/<pid>/maps` for updates.
@ivoanjo
Copy link
Author

ivoanjo commented Dec 18, 2025

I've pushed 3caecfb with the changes described/discussed with @christos68k above.

I'm preparing a PR to update the reference C/C++ implementation to match this change; I'll share that one shortly.

@ivoanjo
Copy link
Author

ivoanjo commented Dec 18, 2025

The update to the reference C/C++ implementation is in open-telemetry/sig-profiling#34 .

As a final quick note I + possibly other folks are going to be out for holidays the next few weeks, so expect discussions to slow down for a bit until we're back in full force in January!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants