Conversation
|
Review requested:
|
644c06d to
b9ba7e3
Compare
jasnell
left a comment
There was a problem hiding this comment.
I'm generally +1 to adding these apis but maybe hanging it off perf_hooks would be better so that it can become a backwards portable semver-minor. We already have Histogram there.
doc/api/metrics.md
Outdated
|
|
||
| // Process request... | ||
|
|
||
| timer.stop(); |
There was a problem hiding this comment.
Just for clarity, how closely are these expected to be modeled against otel metrics?
There was a problem hiding this comment.
It's more aiming for compatibility with everything else. It supports output in statsd, dogstatsd, prometheus, and graphite formats, all of which have a fairly similar data model. OpenTelemetry is somewhat different though and has a bunch more data requirements, like including metric descriptions (in addition to the name) and unit names. It'd be possible to adapt metric reports, assuming the data Otel requires is present.
It's possible we could expand the required inputs to cover all the requirements of OTel, but that's going to make use of this fairly verbose.
They also have some odd structural requirements, like the spec says you can only make a counter from a meter, so we'd basically have to contort our object model to match theirs, which would have some not great performance implications. 😐
There was a problem hiding this comment.
The requirement to make a counter from a meter is to augment the counters created with instrumentation library information so that a consumer of the data can tell who produces the data.
This PR assumes that all metrics are created in the global namespace, it is a simplification, but I'd not call the other pattern as odd.
There was a problem hiding this comment.
Not sure why that has to go through a meter for that though. I'm doing essentially the same thing with metadata folding, but I don't need an extra parent type for that. I just have metric-level metadata and report-level data. (And also per-timer data.)
There was a problem hiding this comment.
A metadata object is just an container and there is no definition what data should be in it, and what an attribute like api.request.duration in this PR means, and what unit (milliseconds, or seconds) the duration is recorded with.
OpenTelemetry as an open standard, not only defines the API of the metrics, but also defines the semantics of the metadata attributes: https://github.com/open-telemetry/semantic-conventions/blob/main/docs/README.md.
With the convention defined, a structured API is an abstraction of the data model OpenTelemetry provides.
I don't see how this affects performance. All the metrics objects in OpeTelemetry are pre-created before the hot-path that collects data points. They are just normal JavaScript objects holding extra information.
There was a problem hiding this comment.
They also have some odd structural requirements, like the spec says you can only make a counter from a meter, so we'd basically have to contort our object model to match theirs
IIUC this is a requirement for implementing the OpenTelemetry API, not a requirement of the data model itself. You could produce OTLP metrics directly from the interface here, provided the metric that you are generating can be expressed in the OTEL data model.
|
I'd have to poke a bit at perf_hooks to figure out how that'd work. I haven't really done anything with it yet. 🤔 I was hoping to eventually have histograms and other things at some point though, so it might make sense. |
5caa4fe to
8294c49
Compare
legendecas
left a comment
There was a problem hiding this comment.
I think this module should provide a clear distinguishment between producer of the metrics and consumer of the metrics.
Consumers may be in different collection model, like a push model (OpenTelemetry Collector) and a pull model (Prometheus). This PR provides specific output formats but they still need additional work to properly set up the output.
As a maintainer of OpenTelemetry, I believe that this module should focus on the metrics producer API as a first step.
|
The intent is for delivery model to be left up to the consumer. It can aggregate in-process and do a pull model, or it can just push data out as the data comes in. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #58874 +/- ##
==========================================
+ Coverage 89.69% 89.74% +0.04%
==========================================
Files 676 678 +2
Lines 206693 208870 +2177
Branches 39577 39914 +337
==========================================
+ Hits 185402 187450 +2048
- Misses 13435 13569 +134
+ Partials 7856 7851 -5
🚀 New features to boost your workflow:
|
The current PR is also involved with specific exportation formats.
As mentioned in the OP of the PR, I believe we should clearly define the scope of the module, and leave the exportation to userland libraries. This is also how the mentioned Swift metrics provider does: https://github.com/apple/swift-metrics?tab=readme-ov-file#selecting-a-metrics-backend-implementation-applications-only. |
It only really has those export formats as an experiment, and because it was trivial to write as those formats are very simple.
Yes, that's the general intent. We may want to remove the current stream formatters. I mostly included those as I'm trying to figure out to what extent the machinery around that should be included. For example the stream factory could be included, and it's exported currently, to provide a simple interface for people to generate output streams, given that most metric formats fit nicely into that metric-per-line model. Most likely we'll want to drop the format-specific bits, even though they're trivial, because formats can come and go. We may want to skip the stream factory too though, not sure. I was also thinking a bit about object model versus channel subscription. In the Swift model there is a class which has a particular protocol which must be confirmed to. We could also do something similar rather than going the channel-based route. It's also possible, as James said, that we could try the perf_hooks base and that may have some implications on the modelling too. I think supporting OpenTelemtry too would be nice, though I worry a bit about imposing too much data requirements on the source. The main concern I have with OTel is that its lack of flexibility due to its significant data presence requirements makes it not so usable with data that's not specifically formed to match its expectations. This also tends to make that data inflexible to serving other uses as it's so tailored to fit what OTel needs and not necessarily what anyone else needs. 🤔 In any case, this is just an experiment at this stage and inviting feedback to guide this toward being something good. |
mcollina
left a comment
There was a problem hiding this comment.
Huge +1 on the idea, this is needed - the prom-client library introduces a bit too much overhead. I'm not sure about this implementation.
I don't really understand how this is supposed to be used.
Usually, in a prometheus setup, the collector gets all the stats in certain intervals from a given HTTP route. How would the stream based setup work in this case?
doc/api/diagnostics_channel.md
Outdated
| } | ||
| ``` | ||
|
|
||
| #### `diagnostics_channel.hasChannel(name)` |
There was a problem hiding this comment.
This is probably worth moving to its own PR.
There was a problem hiding this comment.
Probably, yes. Just was a tiny thing I needed here so added that quickly. 🙂
|
Yeah, the Prometheus model is a bit different from the other formats I've got in there. I think the pull model can work with just aggregating the data on the other end of the channels and having the pull server read from that, the Prometheus stream is probably a bad example for that. 😅 |
|
I think we need to be generic enough to support both models correctly. |
It really doesn't need to integrate too closely with anything else in |
Ah, gotcha. Easy enough to do. I thought you meant actually integrating with it. Which maybe actually makes sense at least in the timers case. A bit unclear if anything there makes sense for non-timer metrics though. 🤔 I'll probably poke at this some more maybe next weekend. I think I'm leaning in the direction of detaching this from diagnostics_channel, or at least making that just an optional hook layer, and instead going for an object model to see if I can make this better fit pull models and possibly able to fit better with the data requirements of OpenTelemetry. Like I said, I'm a bit concerned about the verbosity and data requirements OTel might impose, but the wide adoption should be considered. (Though from what I've seen OTel adoption is mainly in tracing with metrics still dominated by other technologies.) |
| --> | ||
|
|
||
| * `name` {string} The metric name. Must not be empty. | ||
| * `options` {Object} |
There was a problem hiding this comment.
The ability to describe monotonicity is important for backends like Prometheus to correctly process the metric data.
Missing the intent of a metric monotonicity would make Prometheus consumers not been able to use functions like rate, increase etc.
The producer of the metric should be responsible for expressing the intent of the monotonicity.
There was a problem hiding this comment.
IMO, the key concern around this design is that the metrics API is not enough to describe how the metric should be consumed, like metric types (counter, gauge, histograms), monotonicity, if the value could be an integer or a float point value, and advisory bucket boundaries. In addition, the API provides non-essential sugar methods like startTimer. We should distinguish what metrics APIs are necessary to describe a metric, that's consumable to major backends in a performant way.
I think the current shape still amalgamates the metrics API and the metrics data consumer implementation. The current PR seems to me that it replicates the OpenTelemetry SDK implementation. The general question to this approach is similar to #61907 that a Node.js built-in implementation of observability should coordinate with userland libraries.
It describes it at the consumer end, that's the whole point. The producer side just decides the semantics of what the numbers it is recording are about and the consumer decides how to consume that sequence of numbers. It can produce a counter or gauge if it wants, or it can get bucketed into histograms. The producer doesn't need to care about the storage or wire optimization mechanisms you might want to apply to reduce the data, it just cares about saying "here's a data point" to the consumers.
The timer is just a thin wrapper around producing a single value (a duration), it's a simple and convenient wrapper to provide, and durations are a common thing to care about, so it seems perfectly reasonable to me to have that. As for which metrics APIs are necessary, the only thing actually necessary is just recording a number attached to a description of its semantics, which is what I have here. I'm not aware of any type of metric expression which can't be derived from this. 🤔
Why would it not? What use is the producing side without aligning it to a consuming side? You need both for this to be useful.
It includes a few things to be able to integrate cleanly, but the overall design is very different to be able to serve both push and pull models uniformly.
I am trying to coordinate. That's why this has changed a bunch, because of all the feedback from others about OpenTelemetry and Prometheus needs. If you have a specific point you feel needs to be addressed, please raise it. The point of this design, as with diagnostics_channel, is to provide mechanisms for producing data without exerting any opinion on how that should be consumed on the production side. The consuming side can then be specialized to whatever interpretation of that data you like. You can aggregate per-metric in whichever way you choose, and collect those aggregates to be reported at whichever trigger point you want. |
Metric value semantics are essential to be described at the producer end to accurately express the intent. Inferring metric properties like monotonicity and value type at consumer end would result in accuracy loss in the value translations. If a producer produces both integer and float point numbers for a single metric, and a consumer infers it as an integer metric because the first number is an integer, this creates accuracy loss. If a produce only monotonically increments a counter, there is no way for a consumer to infer its monotonicity and exports it correctly. Inferring the metric data by metric name is not a programmatically optimal metric API. These are lessons learnt in OpenTelemetry API design and specification already. The expressiveness of the metric API is critical to a lossless vendor-neutral exportation. |
|
Assuming #61907 lands the OpenTelemetry tracing support, and OpenTelemetry also supports metrics, an integral support with both OTel tracing and metrics would avoid discrepancy and separation. I'm not advocating including the @opentelemetry/api package as-is now. As @dyladan mentioned in #61907 (comment), the OpenTelemetry JS API @opentelemetry/api is been experimented with an event-based (or OpenTelemetry metrics data model defines a similar data pipeline like the API emitting metric events and the SDK aggregates the data events: With OpenTelemetry's vendor-neutral design as an existing art, I think we should take advantage of it, and I would agree that we could improve the implementation like adopting |
Why? The metric descriptor communicates that without needing to explicitly enforce a particular interpretation. If the consumer can't infer correctly from the information given that is a bug in the consumer.
I don’t see how that accuracy loss would ever happen, the aggregations happen on the original source values, not some clamped output. Nothing would or should ever be inferring a number into an integer prior to aggregation There are no counters at the producer level. There's just "here is a number" which for a counter consumer would be interpreted as a delta. I don't understand how monotonicity would be impacted by this. 🤔
There's nothing about the current design preventing lossless output. How you aggregate is up to you. If you write your consumer in a lossy way that's entirely on you. 🤷 To be clear, this design is explicitly intended to be only about the delivery of values to consumers, leaving it up to the consumers to decide how to process those values. It is intended that a consumer will provide a configuration for each metric they want to consume describing the manner in which they would like to consume it. They can provide their own custom aggregators to do whatever they like with those metrics updates. The producing side should not dictate how the data is to be interpreted because there are many ways in which it could be interpreted. For example, a gauge and a counter are the same thing from an input perspective, the only difference is how they materialize. By leaving it up to the consumer to make that decision we can support a broader set of use cases. It is NOT the intent to just ship specifically what OpenTelemetry has decided metrics should look like, because much of the market does not agree. It is ONLY the intent to allow OpenTelemetry to pull data in the structure it expects as one possible interpretation. |
The producer should be aware of how the data is consumed to make the consumption accurate. It is a contract between a producer and a consumer that if a metric of Aggregation also involves with integers and double values. JavaScript numbers are doubles. Incrementing a number above Without producer describing the intent of the metric, leaving this to consumer to "guess" it, or by hard-coding metric names, is not a programatically optimal metrics API. We can not ask every consumer to hard code every metric in every producer. |
Regardless of this API to follow OpenTelemetry or not, I would like to argue that vendor-neutrality is a hard requirement for it to be included in Node.js. The API should be balanced to be able for adoption of various vendors with accurate lossless data exportation. OpenTelemetry is not a vendor, but as an open standard has been developed and adopted by vast majority vendors. |
Why would the producer ever need to be aware of how the data is consumed to make it accurate? It doesn't change what the inputs are in any way, the concern of how to handle the data is only ever on the consuming side. The meta information describing the semantics of the source can easily be communicated through the metric descriptor. I completely disagree that the producer needs to have any control over how the data is consumed. That would make the model less flexible.
So aggregate with a bigint then, or do some modulo overflow looping thing. This is left entirely to the consumer for a reason, so it can handle the data in whichever manner it deems is most appropriate for its constraints. If it care more about performance than perfect accuracy it can take plain numbers and allow the risk of drift--this is likely fine if the aggregation is not cumulative. If perfect accuracy is a requirement, they can use bigint to protect against drift.
Producer does describe intent through how you name the metric though. No need to "guess" anything. Also, consumers should be specifically selecting the metrics they are interested in. The intent of this API, as with diagnostics_channel, is that library authors should feel empowered to liberally produce metrics data about anything, regardless of if the end user cares about those metrics, and pay near zero cost unless there are actual consumers of that metric. For consumers which want to potentially listen to every possible metric there is
That's specifically why it works this way, because the design you are describing only really works well for OpenTelemetry and not pull-based systems.
Most of those vendors default to different metrics systems though. They only support OTLP as one option of many. Also, OpenTelemetry itself is very much a vendor as it provides its own implementation. |
|
The discussion here is going a bit back-and-forth without seemingly getting anywhere. What might make it easier here: @legendecas, can you briefly summarize, (a) do you think this PR shouldn't land at all (as in, we shouldn't have a metrics api in core) or (b) is there a specific alternative API you'd like to see us land, or (c) are there specific changes to this proposed API that you'd like to see. It's just not clear from the discussion so far. |
There was a problem hiding this comment.
(a) do you think this PR shouldn't land at all (as in, we shouldn't have a metrics api in core)
I'm not object to land a metric API in the core. The principle is that the metric API can be neuturally interpreted by different vendors (who stores the data in the backend).
(b) is there a specific alternative API you'd like to see us land
OpenTelemetry is a cross-vendor effort to make the metric API equally interpretable, including the API, semantic convention of metrics (names, units, types, etc.).
Repetitive deminishing OpenTelemetry as a standard does not solely motivate a new standard.
(c) are there specific changes to this proposed API that you'd like to see.
I'm not asking the API to follow OpenTelemetry, but as mentioned, we should acknowledge the cross-vendor effort and make the API neutrualy interpretable by different vendors, for profit or not.
The current API has the following issues:
- Merging all metric types into a single API and takes a big option bag is not user-friendly. It also does not establish a contract between the producer and consumer what metric data could be losslessly consumed by different vendors.
- Not allowing producer to specifying metric properties like types, monotonicity or value types would result in some backend vendor unable to consume the data correctly.
- The current contract between metric producer and consumer is by metric name, description etc. This is not a programmatically optimal API for different vendors to consume data from a variaty of metric producers, imaging a consumer has to hard-code hundreds metric aggregation by metric name from both redis, sqlite, mysql etc. This could be addressed by establishing a contract by the metric API.
All of these issues I summarized above were addressed in the cross-vendor effort OpenTelemetry. So I'm still confused by the PR that it continues to blur the contract, creating consumption barriers.
|
For 1 and 2... can you give a concrete example of the kind of API you'd accept here? |
While OpenTelemetry has "won" in the tracing category, the majority of the market still uses other systems for metrics, and there's a significant portion using pull style metrics with Prometheus. Just copying what OpenTelemetry does here doesn't make sense when most of the market does not do that. I've done my best to enable it as one possible destination for metrics data, but alignment with OpenTelemetry specifically can't reasonably be the priority when most users want something else. It is not my intent to ship an exact replica of the OpenTelemetry Metrics API. It is my intent that it should be able to layer over top of this to gather the data it needs. This is supposed to only provide aggregated snapshots which the consuming code can then decide how to serialize and ship elsewhere. If they want deeper integration than what is provided out of the box they can write their own custom consumer--that is an intentional capability. The goal here is to provide a metrics gathering interface which can be as low-cost as possible to report metrics you may or mat not care about. It is thus using a similar model to diagnostics_channel where it is expected that most metrics will be ignored and therefore discarded without any aggregation effort. We should be doing the minimum work possible before it is determined that anything even cares about that metric.
In what way is it not user-friendly? The data model provides a descriptive name, a descriptor object with more details, and attributes for even further detail. You can provide whatever context you need to decide what to do with it. Also, there's nothing at all stopping us from having use case specific abstractions like counter or gauge interfaces. We already do this with the timer interface, we could easily add others if people feel strongly that they need some particular producer API shapes. I just didn't want to make this more complicated than it needs to be right now.
There is nothing at all stopping one from sharing such metadata. If you feel something is necessary there then please clearly define the exact properties required.
You don't need to hard-code anything though. As I said before,
As already expressed previously, OpenTelemetry is only one possible implementation of the many different styles people are asking for. You seem to keep going back and forth between stating we should be agnostic of metrics systems but then implying we should just do what OpenTelemetry does which is very much not agnostic. It's incredibly confusing. 😕 |

cc @nodejs/diagnostics
This is an experiment to see if it would make sense to include a metrics capturing interface in Node.js core. It's only supported with the
node:prefix as it most definitely conflicts with ecosystem modules with this generic name. We could leave the prefix requirement or rename, I'm not too set on either way.This takes ideas from various ecosystem metrics modules along with some ideas from the Swift metrics provider. The idea being to provide a generic metrics interface which publishes diagnostics_channel events and allowing consumers to listen to those events and either report them directly or aggregate them in some way.