Skip to content

lib: add node:metrics module#58874

Open
Qard wants to merge 18 commits intonodejs:mainfrom
Qard:metrics-provider
Open

lib: add node:metrics module#58874
Qard wants to merge 18 commits intonodejs:mainfrom
Qard:metrics-provider

Conversation

@Qard
Copy link
Member

@Qard Qard commented Jun 28, 2025

cc @nodejs/diagnostics

This is an experiment to see if it would make sense to include a metrics capturing interface in Node.js core. It's only supported with the node: prefix as it most definitely conflicts with ecosystem modules with this generic name. We could leave the prefix requirement or rename, I'm not too set on either way.

This takes ideas from various ecosystem metrics modules along with some ideas from the Swift metrics provider. The idea being to provide a generic metrics interface which publishes diagnostics_channel events and allowing consumers to listen to those events and either report them directly or aggregate them in some way.

@Qard Qard self-assigned this Jun 28, 2025
@Qard Qard added lib / src Issues and PRs related to general changes in the lib or src directory. diagnostics_channel Issues and PRs related to diagnostics channel labels Jun 28, 2025
@nodejs-github-bot
Copy link
Collaborator

Review requested:

  • @nodejs/loaders
  • @nodejs/startup

@nodejs-github-bot nodejs-github-bot added the needs-ci PRs that need a full CI run. label Jun 28, 2025
@Qard Qard force-pushed the metrics-provider branch 2 times, most recently from 644c06d to b9ba7e3 Compare June 28, 2025 18:12
Copy link
Member

@jasnell jasnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm generally +1 to adding these apis but maybe hanging it off perf_hooks would be better so that it can become a backwards portable semver-minor. We already have Histogram there.


// Process request...

timer.stop();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for clarity, how closely are these expected to be modeled against otel metrics?

Copy link
Member Author

@Qard Qard Jun 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's more aiming for compatibility with everything else. It supports output in statsd, dogstatsd, prometheus, and graphite formats, all of which have a fairly similar data model. OpenTelemetry is somewhat different though and has a bunch more data requirements, like including metric descriptions (in addition to the name) and unit names. It'd be possible to adapt metric reports, assuming the data Otel requires is present.

It's possible we could expand the required inputs to cover all the requirements of OTel, but that's going to make use of this fairly verbose.

They also have some odd structural requirements, like the spec says you can only make a counter from a meter, so we'd basically have to contort our object model to match theirs, which would have some not great performance implications. 😐

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The requirement to make a counter from a meter is to augment the counters created with instrumentation library information so that a consumer of the data can tell who produces the data.

This PR assumes that all metrics are created in the global namespace, it is a simplification, but I'd not call the other pattern as odd.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why that has to go through a meter for that though. I'm doing essentially the same thing with metadata folding, but I don't need an extra parent type for that. I just have metric-level metadata and report-level data. (And also per-timer data.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A metadata object is just an container and there is no definition what data should be in it, and what an attribute like api.request.duration in this PR means, and what unit (milliseconds, or seconds) the duration is recorded with.

OpenTelemetry as an open standard, not only defines the API of the metrics, but also defines the semantics of the metadata attributes: https://github.com/open-telemetry/semantic-conventions/blob/main/docs/README.md.

With the convention defined, a structured API is an abstraction of the data model OpenTelemetry provides.

I don't see how this affects performance. All the metrics objects in OpeTelemetry are pre-created before the hot-path that collects data points. They are just normal JavaScript objects holding extra information.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They also have some odd structural requirements, like the spec says you can only make a counter from a meter, so we'd basically have to contort our object model to match theirs

IIUC this is a requirement for implementing the OpenTelemetry API, not a requirement of the data model itself. You could produce OTLP metrics directly from the interface here, provided the metric that you are generating can be expressed in the OTEL data model.

@Qard Qard force-pushed the metrics-provider branch from b9ba7e3 to 59851ba Compare June 28, 2025 18:39
@Qard
Copy link
Member Author

Qard commented Jun 28, 2025

I'd have to poke a bit at perf_hooks to figure out how that'd work. I haven't really done anything with it yet. 🤔

I was hoping to eventually have histograms and other things at some point though, so it might make sense.

@Qard Qard force-pushed the metrics-provider branch 2 times, most recently from 5caa4fe to 8294c49 Compare June 28, 2025 19:04
Copy link
Member

@legendecas legendecas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this module should provide a clear distinguishment between producer of the metrics and consumer of the metrics.

Consumers may be in different collection model, like a push model (OpenTelemetry Collector) and a pull model (Prometheus). This PR provides specific output formats but they still need additional work to properly set up the output.

As a maintainer of OpenTelemetry, I believe that this module should focus on the metrics producer API as a first step.

@Qard
Copy link
Member Author

Qard commented Jun 28, 2025

The intent is for delivery model to be left up to the consumer. It can aggregate in-process and do a pull model, or it can just push data out as the data comes in.

@codecov
Copy link

codecov bot commented Jun 28, 2025

Codecov Report

❌ Patch coverage is 94.48783% with 120 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.74%. Comparing base (8199f9c) to head (e025506).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
lib/internal/perf/metrics.js 93.58% 117 Missing and 3 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #58874      +/-   ##
==========================================
+ Coverage   89.69%   89.74%   +0.04%     
==========================================
  Files         676      678       +2     
  Lines      206693   208870    +2177     
  Branches    39577    39914     +337     
==========================================
+ Hits       185402   187450    +2048     
- Misses      13435    13569     +134     
+ Partials     7856     7851       -5     
Files with missing lines Coverage Δ
lib/internal/util/object_identity.js 100.00% <100.00%> (ø)
lib/perf_hooks.js 100.00% <100.00%> (ø)
lib/internal/perf/metrics.js 93.58% <93.58%> (ø)

... and 49 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@legendecas
Copy link
Member

The intent is for delivery model to be left up to the consumer. It can aggregate in-process and do a pull model, or it can just push data out as the data comes in.

The current PR is also involved with specific exportation formats.

This takes ideas from various ecosystem metrics modules along with some ideas from the Swift metrics provider.

As mentioned in the OP of the PR, I believe we should clearly define the scope of the module, and leave the exportation to userland libraries. This is also how the mentioned Swift metrics provider does: https://github.com/apple/swift-metrics?tab=readme-ov-file#selecting-a-metrics-backend-implementation-applications-only.

@Qard
Copy link
Member Author

Qard commented Jun 28, 2025

The intent is for delivery model to be left up to the consumer. It can aggregate in-process and do a pull model, or it can just push data out as the data comes in.

The current PR is also involved with specific exportation formats.

It only really has those export formats as an experiment, and because it was trivial to write as those formats are very simple.

This takes ideas from various ecosystem metrics modules along with some ideas from the Swift metrics provider.

As mentioned in the OP of the PR, I believe we should clearly define the scope of the module, and leave the exportation to userland libraries. This is also how the mentioned Swift metrics provider does: https://github.com/apple/swift-metrics?tab=readme-ov-file#selecting-a-metrics-backend-implementation-applications-only.

Yes, that's the general intent. We may want to remove the current stream formatters. I mostly included those as I'm trying to figure out to what extent the machinery around that should be included. For example the stream factory could be included, and it's exported currently, to provide a simple interface for people to generate output streams, given that most metric formats fit nicely into that metric-per-line model.

Most likely we'll want to drop the format-specific bits, even though they're trivial, because formats can come and go. We may want to skip the stream factory too though, not sure.

I was also thinking a bit about object model versus channel subscription. In the Swift model there is a class which has a particular protocol which must be confirmed to. We could also do something similar rather than going the channel-based route. It's also possible, as James said, that we could try the perf_hooks base and that may have some implications on the modelling too.

I think supporting OpenTelemtry too would be nice, though I worry a bit about imposing too much data requirements on the source. The main concern I have with OTel is that its lack of flexibility due to its significant data presence requirements makes it not so usable with data that's not specifically formed to match its expectations. This also tends to make that data inflexible to serving other uses as it's so tailored to fit what OTel needs and not necessarily what anyone else needs. 🤔

In any case, this is just an experiment at this stage and inviting feedback to guide this toward being something good.

Copy link
Member

@mcollina mcollina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huge +1 on the idea, this is needed - the prom-client library introduces a bit too much overhead. I'm not sure about this implementation.

I don't really understand how this is supposed to be used.
Usually, in a prometheus setup, the collector gets all the stats in certain intervals from a given HTTP route. How would the stream based setup work in this case?

}
```

#### `diagnostics_channel.hasChannel(name)`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably worth moving to its own PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably, yes. Just was a tiny thing I needed here so added that quickly. 🙂

@Qard
Copy link
Member Author

Qard commented Jun 28, 2025

Yeah, the Prometheus model is a bit different from the other formats I've got in there. I think the pull model can work with just aggregating the data on the other end of the channels and having the pull server read from that, the Prometheus stream is probably a bad example for that. 😅

@mcollina
Copy link
Member

I think we need to be generic enough to support both models correctly.

@jasnell
Copy link
Member

jasnell commented Jun 29, 2025

... I'd have to poke a bit at perf_hooks to figure out how that'd work.

It really doesn't need to integrate too closely with anything else in perf_hooks. Just a perf_hooks.metrics would enable this to be semver-minor and able to be backported.

@Qard
Copy link
Member Author

Qard commented Jun 29, 2025

It really doesn't need to integrate too closely with anything else in perf_hooks. Just a perf_hooks.metrics would enable this to be semver-minor and able to be backported.

Ah, gotcha. Easy enough to do. I thought you meant actually integrating with it. Which maybe actually makes sense at least in the timers case. A bit unclear if anything there makes sense for non-timer metrics though. 🤔

I'll probably poke at this some more maybe next weekend. I think I'm leaning in the direction of detaching this from diagnostics_channel, or at least making that just an optional hook layer, and instead going for an object model to see if I can make this better fit pull models and possibly able to fit better with the data requirements of OpenTelemetry. Like I said, I'm a bit concerned about the verbosity and data requirements OTel might impose, but the wide adoption should be considered. (Though from what I've seen OTel adoption is mainly in tracing with metrics still dominated by other technologies.)

@Qard Qard force-pushed the metrics-provider branch from 432c739 to 5fd8c22 Compare March 22, 2026 17:19
@Qard Qard force-pushed the metrics-provider branch from 3718dd1 to 9a8bf33 Compare March 22, 2026 18:07
- Add missing Promise primordials to diagnostics_channel.js
- Fix JSDoc @returns and @param issues in metrics.js and object_identity.js
- Wrap setImmediate/Promise.then callbacks in mustCall in tests

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@Qard Qard force-pushed the metrics-provider branch from 9a8bf33 to e025506 Compare March 22, 2026 18:10
-->

* `name` {string} The metric name. Must not be empty.
* `options` {Object}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ability to describe monotonicity is important for backends like Prometheus to correctly process the metric data.

Missing the intent of a metric monotonicity would make Prometheus consumers not been able to use functions like rate, increase etc.

The producer of the metric should be responsible for expressing the intent of the monotonicity.

Copy link
Member

@legendecas legendecas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, the key concern around this design is that the metrics API is not enough to describe how the metric should be consumed, like metric types (counter, gauge, histograms), monotonicity, if the value could be an integer or a float point value, and advisory bucket boundaries. In addition, the API provides non-essential sugar methods like startTimer. We should distinguish what metrics APIs are necessary to describe a metric, that's consumable to major backends in a performant way.

I think the current shape still amalgamates the metrics API and the metrics data consumer implementation. The current PR seems to me that it replicates the OpenTelemetry SDK implementation. The general question to this approach is similar to #61907 that a Node.js built-in implementation of observability should coordinate with userland libraries.

@Qard
Copy link
Member Author

Qard commented Mar 22, 2026

IMO, the key concern around this design is that the metrics API is not enough to describe how the metric should be consumed, like metric types (counter, gauge, histograms), monotonicity, if the value could be an integer or a float point value, and advisory bucket boundaries.

It describes it at the consumer end, that's the whole point. The producer side just decides the semantics of what the numbers it is recording are about and the consumer decides how to consume that sequence of numbers. It can produce a counter or gauge if it wants, or it can get bucketed into histograms. The producer doesn't need to care about the storage or wire optimization mechanisms you might want to apply to reduce the data, it just cares about saying "here's a data point" to the consumers.

In addition, the API provides non-essential sugar methods like startTimer. We should distinguish what metrics APIs are necessary to describe a metric, that's consumable to major backends in a performant way.

The timer is just a thin wrapper around producing a single value (a duration), it's a simple and convenient wrapper to provide, and durations are a common thing to care about, so it seems perfectly reasonable to me to have that.

As for which metrics APIs are necessary, the only thing actually necessary is just recording a number attached to a description of its semantics, which is what I have here. I'm not aware of any type of metric expression which can't be derived from this. 🤔

I think the current shape still amalgamates the metrics API and the metrics data consumer implementation.

Why would it not? What use is the producing side without aligning it to a consuming side? You need both for this to be useful.

The current PR seems to me that it replicates the OpenTelemetry SDK implementation.

It includes a few things to be able to integrate cleanly, but the overall design is very different to be able to serve both push and pull models uniformly.

The general question to this approach is similar to #61907 that a Node.js built-in implementation of observability should coordinate with userland libraries.

I am trying to coordinate. That's why this has changed a bunch, because of all the feedback from others about OpenTelemetry and Prometheus needs. If you have a specific point you feel needs to be addressed, please raise it.

The point of this design, as with diagnostics_channel, is to provide mechanisms for producing data without exerting any opinion on how that should be consumed on the production side. The consuming side can then be specialized to whatever interpretation of that data you like. You can aggregate per-metric in whichever way you choose, and collect those aggregates to be reported at whichever trigger point you want.

@legendecas
Copy link
Member

It describes it at the consumer end, that's the whole point. The producer side just decides the semantics of what the numbers it is recording are about and the consumer decides how to consume that sequence of numbers. It can produce a counter or gauge if it wants, or it can get bucketed into histograms. The producer doesn't need to care about the storage or wire optimization mechanisms you might want to apply to reduce the data, it just cares about saying "here's a data point" to the consumers.

Metric value semantics are essential to be described at the producer end to accurately express the intent. Inferring metric properties like monotonicity and value type at consumer end would result in accuracy loss in the value translations.

If a producer produces both integer and float point numbers for a single metric, and a consumer infers it as an integer metric because the first number is an integer, this creates accuracy loss. If a produce only monotonically increments a counter, there is no way for a consumer to infer its monotonicity and exports it correctly. Inferring the metric data by metric name is not a programmatically optimal metric API.

These are lessons learnt in OpenTelemetry API design and specification already. The expressiveness of the metric API is critical to a lossless vendor-neutral exportation.

@legendecas
Copy link
Member

Assuming #61907 lands the OpenTelemetry tracing support, and OpenTelemetry also supports metrics, an integral support with both OTel tracing and metrics would avoid discrepancy and separation.

I'm not advocating including the @opentelemetry/api package as-is now. As @dyladan mentioned in #61907 (comment), the OpenTelemetry JS API @opentelemetry/api is been experimented with an event-based (or diagnostic_channels-based) design as well.

OpenTelemetry metrics data model defines a similar data pipeline like the API emitting metric events and the SDK aggregates the data events:

data-model

With OpenTelemetry's vendor-neutral design as an existing art, I think we should take advantage of it, and I would agree that we could improve the implementation like adopting diagnostic_channels.

@Qard
Copy link
Member Author

Qard commented Mar 22, 2026

Metric value semantics are essential to be described at the producer end to accurately express the intent. Inferring metric properties like monotonicity and value type at consumer end would result in accuracy loss in the value translations.

Why? The metric descriptor communicates that without needing to explicitly enforce a particular interpretation. If the consumer can't infer correctly from the information given that is a bug in the consumer.

If a producer produces both integer and float point numbers for a single metric, and a consumer infers it as an integer metric because the first number is an integer, this creates accuracy loss. If a produce only monotonically increments a counter, there is no way for a consumer to infer its monotonicity and exports it correctly. Inferring the metric data by metric name is not a programmatically optimal metric API.

I don’t see how that accuracy loss would ever happen, the aggregations happen on the original source values, not some clamped output. Nothing would or should ever be inferring a number into an integer prior to aggregation

There are no counters at the producer level. There's just "here is a number" which for a counter consumer would be interpreted as a delta. I don't understand how monotonicity would be impacted by this. 🤔

These are lessons learnt in OpenTelemetry API design and specification already. The expressiveness of the metric API is critical to a lossless vendor-neutral exportation.

There's nothing about the current design preventing lossless output. How you aggregate is up to you. If you write your consumer in a lossy way that's entirely on you. 🤷

To be clear, this design is explicitly intended to be only about the delivery of values to consumers, leaving it up to the consumers to decide how to process those values. It is intended that a consumer will provide a configuration for each metric they want to consume describing the manner in which they would like to consume it. They can provide their own custom aggregators to do whatever they like with those metrics updates.

The producing side should not dictate how the data is to be interpreted because there are many ways in which it could be interpreted. For example, a gauge and a counter are the same thing from an input perspective, the only difference is how they materialize. By leaving it up to the consumer to make that decision we can support a broader set of use cases.

It is NOT the intent to just ship specifically what OpenTelemetry has decided metrics should look like, because much of the market does not agree. It is ONLY the intent to allow OpenTelemetry to pull data in the structure it expects as one possible interpretation.

@legendecas
Copy link
Member

legendecas commented Mar 22, 2026

The producing side should not dictate how the data is to be interpreted because there are many ways in which it could be interpreted. For example, a gauge and a counter are the same thing from an input perspective, the only difference is how they materialize. By leaving it up to the consumer to make that decision we can support a broader set of use cases.

The producer should be aware of how the data is consumed to make the consumption accurate. It is a contract between a producer and a consumer that if a metric of request_count should only monotonically increment, with integers. And a consumer could just allocate an integer storage for it.

Aggregation also involves with integers and double values. JavaScript numbers are doubles. Incrementing a number above Number.MAX_SAFE_INTEGER would silently convert the number to a floating point value, and this breaks the contract between the producer and consumer, and create accuracy loss.

Without producer describing the intent of the metric, leaving this to consumer to "guess" it, or by hard-coding metric names, is not a programatically optimal metrics API. We can not ask every consumer to hard code every metric in every producer.

@legendecas
Copy link
Member

legendecas commented Mar 22, 2026

It is NOT the intent to just ship specifically what OpenTelemetry has decided metrics should look like, because much of the market does not agree. It is ONLY the intent to allow OpenTelemetry to pull data in the structure it expects as one possible interpretation.

Regardless of this API to follow OpenTelemetry or not, I would like to argue that vendor-neutrality is a hard requirement for it to be included in Node.js. The API should be balanced to be able for adoption of various vendors with accurate lossless data exportation.

OpenTelemetry is not a vendor, but as an open standard has been developed and adopted by vast majority vendors.

@Qard
Copy link
Member Author

Qard commented Mar 23, 2026

The producer should be aware of how the data is consumed to make the consumption accurate. It is a contract between a producer and a consumer that if a metric of request_count should only monotonically increment, with integers. And a consumer could just allocate an integer storage for it.

Why would the producer ever need to be aware of how the data is consumed to make it accurate? It doesn't change what the inputs are in any way, the concern of how to handle the data is only ever on the consuming side. The meta information describing the semantics of the source can easily be communicated through the metric descriptor. I completely disagree that the producer needs to have any control over how the data is consumed. That would make the model less flexible.

Aggregation also involves with integers and double values. JavaScript numbers are doubles. Incrementing a number above Number.MAX_SAFE_INTEGER would silently convert the number to a floating point value, and this breaks the contract between the producer and consumer, and create accuracy loss.

So aggregate with a bigint then, or do some modulo overflow looping thing. This is left entirely to the consumer for a reason, so it can handle the data in whichever manner it deems is most appropriate for its constraints. If it care more about performance than perfect accuracy it can take plain numbers and allow the risk of drift--this is likely fine if the aggregation is not cumulative. If perfect accuracy is a requirement, they can use bigint to protect against drift.

Without producer describing the intent of the metric, leaving this to consumer to "guess" it, or by hard-coding metric names, is not a programatically optimal metrics API. We can not ask every consumer to hard code every metric in every producer.

Producer does describe intent through how you name the metric though. No need to "guess" anything. Also, consumers should be specifically selecting the metrics they are interested in. The intent of this API, as with diagnostics_channel, is that library authors should feel empowered to liberally produce metrics data about anything, regardless of if the end user cares about those metrics, and pay near zero cost unless there are actual consumers of that metric. For consumers which want to potentially listen to every possible metric there is onMetricCreated and onMetricClosed hooks to decide how to act on each individually seen metric, so the consumer can inspect the metric descriptor and make a decision on how it should consume that metric going forward.

Regardless of this API to follow OpenTelemetry or not, I would like to argue that vendor-neutrality is a hard requirement for it to be included in Node.js. The API should be balanced to be able for adoption of various vendors with accurate lossless data exportation.

That's specifically why it works this way, because the design you are describing only really works well for OpenTelemetry and not pull-based systems.

OpenTelemetry is not a vendor, but as an open standard has been developed and adopted by vast majority vendors.

Most of those vendors default to different metrics systems though. They only support OTLP as one option of many. Also, OpenTelemetry itself is very much a vendor as it provides its own implementation.

@jasnell
Copy link
Member

jasnell commented Mar 23, 2026

The discussion here is going a bit back-and-forth without seemingly getting anywhere. What might make it easier here: @legendecas, can you briefly summarize, (a) do you think this PR shouldn't land at all (as in, we shouldn't have a metrics api in core) or (b) is there a specific alternative API you'd like to see us land, or (c) are there specific changes to this proposed API that you'd like to see. It's just not clear from the discussion so far.

Copy link
Member

@legendecas legendecas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(a) do you think this PR shouldn't land at all (as in, we shouldn't have a metrics api in core)

I'm not object to land a metric API in the core. The principle is that the metric API can be neuturally interpreted by different vendors (who stores the data in the backend).

(b) is there a specific alternative API you'd like to see us land

OpenTelemetry is a cross-vendor effort to make the metric API equally interpretable, including the API, semantic convention of metrics (names, units, types, etc.).

Repetitive deminishing OpenTelemetry as a standard does not solely motivate a new standard.

(c) are there specific changes to this proposed API that you'd like to see.

I'm not asking the API to follow OpenTelemetry, but as mentioned, we should acknowledge the cross-vendor effort and make the API neutrualy interpretable by different vendors, for profit or not.

The current API has the following issues:

  1. Merging all metric types into a single API and takes a big option bag is not user-friendly. It also does not establish a contract between the producer and consumer what metric data could be losslessly consumed by different vendors.
  2. Not allowing producer to specifying metric properties like types, monotonicity or value types would result in some backend vendor unable to consume the data correctly.
  3. The current contract between metric producer and consumer is by metric name, description etc. This is not a programmatically optimal API for different vendors to consume data from a variaty of metric producers, imaging a consumer has to hard-code hundreds metric aggregation by metric name from both redis, sqlite, mysql etc. This could be addressed by establishing a contract by the metric API.

All of these issues I summarized above were addressed in the cross-vendor effort OpenTelemetry. So I'm still confused by the PR that it continues to blur the contract, creating consumption barriers.

@jasnell
Copy link
Member

jasnell commented Mar 23, 2026

For 1 and 2... can you give a concrete example of the kind of API you'd accept here?

@Qard
Copy link
Member Author

Qard commented Mar 23, 2026

(b) is there a specific alternative API you'd like to see us land

OpenTelemetry is a cross-vendor effort to make the metric API equally interpretable, including the API, semantic convention of metrics (names, units, types, etc.).

Repetitive deminishing OpenTelemetry as a standard does not solely motivate a new standard.

While OpenTelemetry has "won" in the tracing category, the majority of the market still uses other systems for metrics, and there's a significant portion using pull style metrics with Prometheus. Just copying what OpenTelemetry does here doesn't make sense when most of the market does not do that. I've done my best to enable it as one possible destination for metrics data, but alignment with OpenTelemetry specifically can't reasonably be the priority when most users want something else.

It is not my intent to ship an exact replica of the OpenTelemetry Metrics API. It is my intent that it should be able to layer over top of this to gather the data it needs. This is supposed to only provide aggregated snapshots which the consuming code can then decide how to serialize and ship elsewhere. If they want deeper integration than what is provided out of the box they can write their own custom consumer--that is an intentional capability.

The goal here is to provide a metrics gathering interface which can be as low-cost as possible to report metrics you may or mat not care about. It is thus using a similar model to diagnostics_channel where it is expected that most metrics will be ignored and therefore discarded without any aggregation effort. We should be doing the minimum work possible before it is determined that anything even cares about that metric.

(c) are there specific changes to this proposed API that you'd like to see.

I'm not asking the API to follow OpenTelemetry, but as mentioned, we should acknowledge the cross-vendor effort and make the API neutrualy interpretable by different vendors, for profit or not.

The current API has the following issues:

  1. Merging all metric types into a single API and takes a big option bag is not user-friendly. It also does not establish a contract between the producer and consumer what metric data could be losslessly consumed by different vendors.

In what way is it not user-friendly? The data model provides a descriptive name, a descriptor object with more details, and attributes for even further detail. You can provide whatever context you need to decide what to do with it.

Also, there's nothing at all stopping us from having use case specific abstractions like counter or gauge interfaces. We already do this with the timer interface, we could easily add others if people feel strongly that they need some particular producer API shapes. I just didn't want to make this more complicated than it needs to be right now.

  1. Not allowing producer to specifying metric properties like types, monotonicity or value types would result in some backend vendor unable to consume the data correctly.

There is nothing at all stopping one from sharing such metadata. If you feel something is necessary there then please clearly define the exact properties required.

  1. The current contract between metric producer and consumer is by metric name, description etc. This is not a programmatically optimal API for different vendors to consume data from a variaty of metric consumers, imaging a consumer has to hard-code hundreds metric aggregation by metric name from both redis, sqlite, mysql etc. This could be addressed by establishing a contract by the metric API.

You don't need to hard-code anything though. As I said before, onMetricCreated and onMetricClosed already exist for this purpose. You can inspect the descriptor to decide how each metric should be processed. If you want additional detail in the descriptors then please specify exactly what that detail is.

All of these issues I summarized above were addressed in the cross-vendor effort OpenTelemetry. So I'm still confused by the PR that it continues to blur the contract, creating consumption barriers.

As already expressed previously, OpenTelemetry is only one possible implementation of the many different styles people are asking for. You seem to keep going back and forth between stating we should be agnostic of metrics systems but then implying we should just do what OpenTelemetry does which is very much not agnostic. It's incredibly confusing. 😕

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

diagnostics_channel Issues and PRs related to diagnostics channel lib / src Issues and PRs related to general changes in the lib or src directory. needs-ci PRs that need a full CI run.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants