feat: produce OpenTelemetry traces with hs-opentelemetry#3140
feat: produce OpenTelemetry traces with hs-opentelemetry#3140develop7 wants to merge 1 commit intoPostgREST:mainfrom
hs-opentelemetry#3140Conversation
|
Awesome work! 🔥 🔥
Found this Nix flake that contains an OTel GUI: https://flakestry.dev/flake/github/FriendsOfOpenTelemetry/opentelemetry-nix/1.0.1 I'll try to integrate that once the PR is ready for review. |
8c0e16a to
64a0ee9
Compare
|
The recent problem I'm seemingly stuck with is There's a more straightforward It also seems to boil down to the conceptual choice between online and offline traces' delivery-wise, or push and pull model. @steve-chavez @wolfgangwalther @laurenceisla what do you think guys? |
|
@develop7 Would vault help? It was introduced on #1988, I recall it helped with IORef handling. It's still used on postgrest/src/PostgREST/Auth.hs Lines 160 to 165 in d2fb67f I'm still not that familiar with OTel but the basic idea I had was to store these traces on AppState and export them async. |
6b891c2 to
586e7a1
Compare
Not only that, you want traces in tests too, for one. The good news is
Good call @steve-chavez, thank you for the suggestion. Will try too. |
0830a45 to
dc882f1
Compare
|
Since now we have an postgrest/src/PostgREST/App.hs Lines 170 to 172 in 229bc77 postgrest/src/PostgREST/Observation.hs Lines 15 to 18 in 229bc77 Perhaps we can add some observations for the timings? Also the Logger is now used like: postgrest/src/PostgREST/Logger.hs Lines 53 to 54 in 7c6c056 postgrest/src/PostgREST/CLI.hs Line 50 in 7c6c056 For OTel, maybe the following would make sense: otelState <- Otel.init
App.run appState (Logger.logObservation loggerState >> OTel.tracer otelState)) |
dc882f1 to
7794848
Compare
Agreed, server timings definitely belong there. |
7794848 to
398206b
Compare
398206b to
4cd99c6
Compare
|
Okay, the PR is in the cooking for long enough, let's pull the plug and start small. Let's have it reviewed while I'm fixing the remaining CI failures. |
4cd99c6 to
94d2b9b
Compare
|
I don't think we depend on this in the current state. And we should certainly not depend on an even-less-maintained fork of the same. So to go forward here, there needs to be some effort put into the upstream package first, to make it usable for us. |
590d142 to
e809a65
Compare
|
A status update:
|
Hm. I looked at your fork. It depends on support for GHC 9.8 in I guess for GHC 9.8 support it's just a matter of time. What about the other issues mentioned above? Were you able to make progress on those? |
In my prototype I actually played with replacing HASQL Session with an https://github.com/haskell-effectful/effectful based monad to make it extensible: https://github.com/mkleczek/hasql-api/blob/master/src/Hasql/Api/Eff/Session.hs#L37 Using it in PostgREST required some mixins usage in Cabal: 29b946e#diff-eb6a76805a0bd3204e7abf68dcceb024912d0200dee7e4e9b9bce3040153f1e1R140 Some work was required in PostgREST startup/configuration code to set-up appropriate effect handlers and middlewares but the changes were quite well isolated. At the end of the day I think basing your monad stack on an effect library (effectful, cleff etc.) is the way forward as it makes the solution highly extensible and configurable. |
e809a65 to
4697009
Compare
650d008 to
ac33872
Compare
|
Update: rebased the PR against latest |
|
@mkleczek I've also added |
As mentioned before, this PR is already too long, we shouldn't be adding more to it.. it's a lot to review as it is. |
|
@steve-chavez done, extracted it to #4666 |
|
An idea to test this. Could we have I'm imagining the test could be similar to how we capture our schema cache snapshots (ref). |
@develop7 is this something that can be done here? The manual tests are working right now as mentioned in #3140 (comment), so maybe we could implement this test later on if it's not that feasible to do here? (cc. @steve-chavez) |
|
re: tests with actual otel collector — I've managed to prototype collector-including tests that compile, fail and and now I'm working on making them pass and be useful. |
|
the test is in, chose to manage collector binary from Haskell for a change |
| @@ -0,0 +1,22 @@ | |||
| [ | |||
There was a problem hiding this comment.
Q: What does this test? 🤔
So my expectation was:
- To have a test that has a client doing some requests
- Capture the generated OTel output in a file
Is it possible to get close to that?
There was a problem hiding this comment.
The test does one request, generates 5 (6 actually) spans and is making sure 1) all the spans of types specified in this file are present in the output, 2) they have spans of specified type as their parent (here all of them are of type request). Since spans are located by their unique spanId and parentSpanId respectively, this test also makes sure spans are linked properly; doesn't test for the lack of astray spans though, which matches the real life behavior IMO. This file is kind of scarce, I admit, but that's because the data in traces are either volatile (timestamps, IDs, timings) or fixed through the whole test (library name, SDK name, programming language, etc.). I've attached a sample collector output file for the reference, feel free to look it up and suggest more things to test: scratch_4.json
To think of it, I'd throw in matching against the SQL query, let me try writing it to the trace.
There was a problem hiding this comment.
I've attached a sample collector output file for the reference, feel free to look it up and suggest more things to test: scratch_4.json
Many thanks for that. By looking at the file I'm finally starting to understand OTel concepts (what's a resource, what's scope, etc).
How do you generate the json file? The way I'm thinking to test this:
- Generate the otel json file with some command.
- Use
jqto filter out the non-deterministic parts - Then compare it to a snapshotted file
I'm not sure if Haskell is the right tool here because so far it looks it results in more code. If we could do it with jq it would be less to maintain it seems.
There was a problem hiding this comment.
How do you generate the json file?
I start an otelcol OpenTelemetry collector supplying it a configuration file generated by OTelHelper.collectorConfig, start a PostgREST instance wired to said collector instance, then perform requests to PostgREST instance, shut everything down (so traces get flushed for sure), both PostgREST and collector, and take the output file in the temporary directory containing traces' JSON.
There was a problem hiding this comment.
If we could do it with
jqit would be less to maintain it seems.
Maybe with optics instead? We do have optics-core & optics-extras through dependencies already, so no build time hit. Let me try that
There was a problem hiding this comment.
Maybe with
opticsinstead?
Nope, same amount of code, but harder to read :)
There was a problem hiding this comment.
Given there are lots of non-deterministic parts on the otel output, I don't think snapshot testing is right.. it doesn't help that much in understanding.
How about just matching the deterministic parts in Haskell? Things like the name = "hs-opentelemetry-instrumentation-wai", telemetry.sdk.name = hs-opentelemetry-sdk, etc.
I start an otelcol OpenTelemetry collector supplying it a configuration file generated by OTelHelper.collectorConfig, start a PostgREST instance wired to said collector instance, then perform requests to PostgREST instance, shut everything down (so traces get flushed for sure), both PostgREST and collector, and take the output file in the temporary directory containing traces' JSON.
It would be great if running the test actually persists that file somewhere so we can read it afterwards (helps in understanding for otel newcomers)
There was a problem hiding this comment.
It would be great if running the test actually persists that file somewhere so we can read it afterwards (helps in understanding for otel newcomers)
Good point, it is now.
|
Discovered a way to relay metrics to OpenTelemetry collector with the otelcol's own https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/receiver/prometheusreceiver, added a rough config draft to the OpenTelemetry page; should help work the lack of native metrics in hs-otel around for now. @steve-chavez @wolfgangwalther I've realized |
Cool, that looks good.
Yes, it would fit better since |
wolfgangwalther
left a comment
There was a problem hiding this comment.
The nix build seems to need some fixes for some of the opentelemetry dependencies. You can try to mark unbroken first, then see what the build error is etc.
I will have a look at the nixpkgs side of this as well.
@wolfgangwalther it's Actually there's hs-opentelemetry-api v0.3.0.1 released sometime ago that would also let me drop my homemade workaround; I'll try with upgrading it first then. UPD: all right, how do I unbreak a nix package for us? |
|
@wolfgangwalther I've updated |
Yeah, I noticed that while working on the Nixpkgs side to unbreak these deps.
Yeah, that's a fine approach for now, we will want to switch back to the Nixpkgs-provided versions eventually, though. |
|
all right, cabal builds are green as well now, had to add few VCS deps to the cabal.project |
wolfgangwalther
left a comment
There was a problem hiding this comment.
Most things are smaller nits. I guess we need to get this over the line now.
| let tracer' = if configServerOtelEnabled conf && tracerIsEnabled tracer | ||
| then Just tracer | ||
| else Nothing |
There was a problem hiding this comment.
Which scenario is config-server-otel-enabled = true and tracerIsEnabled = false? Aka can it happen that one tries to enable otel via the config option, but otel is effectively still disabled? How? Do / should we log this or throw an error or so?
There was a problem hiding this comment.
Which scenario is config-server-otel-enabled = true and tracerIsEnabled = false? Aka can it happen that one tries to enable otel via the config option, but otel is effectively still disabled? How?
tracerIsEnabled == true when otel is enabled (it is by default) but no trace collectors to post them to have been configured, per the iand675/hs-opentelemetry#201. This results in resources being consumed for tracing but no traces actually make their way to a consumer. That's why I'm taking this into account and plugging up the case with an explicit noop.
Do / should we log this or throw an error or so?
We do not, but this is a great idea, and yes, we should, now that we have our own otel-related config parameter. Thank you for pointing that out. IMO this warrants for a warning rather than an hard error (so the user wouldn't have to configure the whole otel stack in order to continue playing around with postgrest). Let me add the thing in a jiff.
There was a problem hiding this comment.
there seems no way to log a warning and continue, there's either "here's config value I've parsed" or fail "something bad happened". Failure it is then. UPD: OR I could move the tracerIsEnabled value to AppState, still emit a warning, just later on during app initialization and let us control otel-enabled in runtime which might come in handy for end user given how resource-consuming OTel currently is. @wolfgangwalther @steve-chavez WDYT?
There was a problem hiding this comment.
let us control
otel-enabledin runtime which might come in handy for end user given how resource-consuming OTel currently is
You mean, I could start with otel-enabled = false, but provide all the setup for otel environment variables themselves (so tracerIsEnabled = true). Then, when I have specific use-case I want to look at, I enable otel-enabled = true via reloading and the data starts flowing?
Sounds useful to me.
PR here: NixOS/nixpkgs#512285 |
|
I rebased, resolved conflicts and addressed most of my own feedback from above. There's still some comments to look at - and some CI results, not sure whether I was able to test everything locally. |
* Introduces producing OpenTelemetry traces with hs-opentelemetry. * Adds OTel spans over the whole application loop and over each request processing phase * Preliminary OTel tracing support in spec tests * Disables tracing in load and memory tests fix deps
|
The following comments are still open:
What's the plan for getting these resolved? |
|
@wolfgangwalther they're being addressed, will push updates on Monday |
|
yes, precisely
|

This PR introduces producing OpenTelemetry traces containing, among others, metrics same as in
ServerTimingheader from before.TODO:
build with Nix as well (for now Stack only)DONE for freemake an example of exporting log messageshs-opentelemetrydoesn't support logging, per Logging roadmap iand675/hs-opentelemetry#100makeseems like impossible without major refactoringgetTraceravailable globally: we're interested in using as many different spans as it makes sense, sogetTracershould be available everywhere, as described inhs-opentelemetry-sdk's READMEhs-opentelemetry-waimiddlewarelook into failing Windows buildshs-opentelemetry-sdkdepends onunix, tracking in Windows support iand675/hs-opentelemetry#109Running:
I sort of gave up deploying and configuring all the moving bits locally, so you'd need to create the honeycomb.io account for this one (or ask me for the invite). After that, it's quite straightforward:
stack build, and get its path withstack exec -- which postgrestnix-shell, thenpostgrest-with-postgresql-15 --fixture ./test/load/fixture.sql -- cat). Note the server URL, you'll need it when running PostgREST serverpostgrest-jwt --exp 36000 postgrest_test_anonymousTests
hspec tests are also instrumented, for those to produce traces you need to set
OTEL_*vars only: