Skip to content

Conversation

@vipvap
Copy link

@vipvap vipvap commented Apr 15, 2025

No description provided.

hlinnaka and others added 30 commits April 8, 2025 09:03
Also, move the call to the lfc_init() function. It was weird to have it
in libpagestore.c, when libpagestore.c otherwise had nothing to do with
the LFC. Move it directly into _PG_init()
…nts/rust/tokio-postgres in the cargo group across 1 directory (#11478)

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
## Problem
`test_scrubber_tenant_snapshot` is flaky with `request was dropped`
errors. More details are in the issue.
- Closes: #11278

## Summary of changes
- Disable shard scheduling during pageservers restart
- Add `reconcile_until_idle` in the end of the test
- Work on neondatabase/neon-archive-cloud#24896
- Cplane part neondatabase/neon-archive-cloud#26808

Instead of reconfiguring rsyslog via an API endpoint [we have
agreed](https://neondb.slack.com/archives/C04DGM6SMTM/p1743513810964509?thread_ts=1743170369.865859&cid=C04DGM6SMTM)
to have a new `logs_export_host` field as part of the compute spec.

---------

Co-authored-by: Tristan Partin <tristan@neon.tech>
Service targeted for storing and retrieving LFC prewarm data.
Can be used for proxying S3 access for Postgres extensions like
pg_mooncake as well.

Requests must include a Bearer JWT token.
Token is validated using a pemfile (should be passed in infra/).

Note: app is not tolerant to extra trailing slashes, see app.rs
`delete_prefix` test for comments.

Resolves: https://github.com/neondatabase/cloud/issues/26342
Unrelated changes: gate a `rename_noreplace` feature and disable it in
`remote_storage` so as `object_storage` can be built with musl
## Problem

For future gc-compaction tests when we support
#10395

## Summary of changes

Add a new type of neon test WAL record that is conditionally applied
(i.e., only when image == the specified value). We can use this to mock
the situation where we lose some records in the middle, firing an error,
and see how gc-compaction reacts to it.

Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem

This is generated e.g. by `test_historic_storage_formats`, and causes
VSCode to list all the contained files as new.

## Summary of changes

Add `/artifact_cache` to `.gitignore`.
…11181)

## Problem

`tenant_import`, used to import an existing tenant from remote storage
into a storage controller for support and debugging, assumed
`DEFAULT_STRIPE_SIZE` since this can't be recovered from remote storage.
In #11168, we are changing the stripe size, which will break
`tenant_import`.

Resolves #11175.

## Summary of changes

* Add `stripe_size` to the tenant manifest.
* Add `TenantScanRemoteStorageShard::stripe_size` and return from
`tenant_scan_remote` if present.
* Recover the stripe size during`tenant_import`, or fall back to 32768
(the original default stripe size).
* Add tenant manifest compatibility snapshot:
`2025-04-08-pgv17-tenant-manifest-v1.tar.zst`

There are no cross-version concerns here, since unknown fields are
ignored during deserialization where relevant.
## Problem

Support of unlogged build in DEBUG_COMPARE_LOCAL.
Neon SMGR treats present of local file as indicator of unlogged
relations.
But it doesn't work in  DEBUG_COMPARE_LOCAL mode.

## Summary of changes

Use INIT_FORKNUM as indicator of unlogged file and create this file
while unlogged index build.

Co-authored-by: Konstantin Knizhnik <knizhnik@neon.tech>
## Problem

With the recent improvements to L0 compaction responsiveness,
`test_create_snapshot` now ends up generating 10,000 layer files
(compared to 1,000 in previous snapshots). This increases the snapshot
size by 4x, and significantly slows down tests.

## Summary of changes

Increase the target layer size from 128 KB to 256 KB, and the L0
compaction threshold from 1 to 5. This reduces the layer count from
about 10,000 to 1,000.
## Problem
Storcon will not start up if `use_https` is on and there are some
pageservers or safekeepers without https port in the database. Metrics
"how many nodes with https we have in DB" will help us to make sure that
`use_https` may be turned on safely.
- Part of neondatabase/neon-archive-cloud#25526

## Summary of changes
- Add `storage_controller_https_pageserver_nodes`,
`storage_controller_safekeeper_nodes` and
`storage_controller_https_safekeeper_nodes` Prometheus metrics.
## Problem

The current stripe size of 256 MB is a bit large, and can cause load
imbalances across shards. A stripe size of 16 MB appears more reasonable
to avoid hotspots, although we don't see evidence of this in benchmarks.

Resolves neondatabase/neon-archive-cloud#25634.
Touches neondatabase/neon-archive-cloud#21870.

## Summary of changes

* Change the default stripe size to 16 MB.
* Remove `ShardParameters::DEFAULT_STRIPE_SIZE`, and only use
`pageserver_api::shard::DEFAULT_STRIPE_SIZE`.
* Update a bunch of tests that assumed a certain stripe size.
The timeline stopping state is set much earlier than the cancellation
token is fired, so by checking for the stopping state, we can prevent
races with timeline shutdown where we issue a cancellation error but the
cancellation token hasn't been fired yet.

Fix #11427.
…1459)

pagestore_smgr.c had grown pretty large. Split into two parts, such
that the smgr routines that PostgreSQL code calls stays in
pagestore_smgr.c, and all the prefetching logic and other lower-level
routines related to communicating with the pageserver are moved to a
new source file, "communicator.c".

There are plans to replace communicator parts with a new
implementation. See #10799.
This commit doesn't implement any of the new things yet, but it is
good preparation for it. I'm imagining that the new implementation
will approximately replace the current "communicator.c" code, exposing
roughly the same functions to pagestore_smgr.c.

This commit doesn't change any functionality or behavior, or make any
other changes to the existing code: It just moves existing code
around.
…timestamp is not found (#11491)

## Problem
Now `get_timestamp_of_lsn` returns `404 NotFound` if there is no clog
pages for given LSN, and it's difficult to distinguish from other 404
errors. A separate status code for this error will allow the control
plane to handle this case.
- Closes: #11439
- Corresponding PR in control plane:
neondatabase/neon-archive-cloud#27125

## Summary of changes
- Return `412 PreconditionFailed` instead of `404 NotFound` if no
timestamp is fond for given LSN.

I looked briefly through the current error handling code in cloud.git
and the status code change should not affect anything for the existing
code. Change from the corresponding PR also looks fine and should work
with the current PS status code. Additionally, here is OK to merge it
from control plane team:
#11439 (comment)

---------

Co-authored-by: John Spray <john@neon.tech>
## Problem

We wish to improve pageserver batching such that one batch can contain
requests for
pages at different LSNs. The current shape of the code doesn't lend
itself to the change.

## Summary of changes

Refactor the read path such that the fringe gets initialized upfront.
This is where the multi LSN
change will plug in. A couple other small changes fell out of this.

There should be NO behaviour change here. If you smell one, shout!

I recommend reviewing commits individually (intentionally made them as
small as possible).

Related: #10765
I like to run nightly clippy every so often to make our future rust
upgrades easier. Some notable changes:

* Prefer `next_back()` over `last()`. Generic iterators will implement
`last()` to run forward through the iterator until the end.

* Prefer `io::Error::other()`.

* Use implicit returns

One case where I haven't dealt with the issues is the now
[more-sensitive "large enum variant"
lint](rust-lang/rust-clippy#13833). I chose not
to take any decisions around it here, and simply marked them as allow
for now.
We'd like to run benchmarks starting from a steady state. To this end,
do a reconciliation round before proceeding with the benchmark.

This is useful for benchmarks that use tenant dir snapshots since a
non-standard tenant configuration is used to generate the snapshot. The
storage controller is not aware of the non default tenant configuration
and will reconcile while the bench is running.
Instead of encoding a certain structure for claims, let's allow the
caller to specify what claims be encoded.

Signed-off-by: Tristan Partin <tristan@neon.tech>
This is preparatory work for teaching neon_local to pass the
Authorization header to compute_ctl.

Signed-off-by: Tristan Partin <tristan@neon.tech>
…ly) (#11498)

## Problem

our large oltp benchmark runs very long - we want to remove the duration
of the reindex step.
we don't run concurrent workload anyhow but added "concurrently" only to
have a "prod-like" approach. But if it just doubles the time we report
because it requires two instead of one full table scan we can remove it

## Summary of changes

remove keyword concurrently from the reindex step
#11497)

## Problem

Not a complete fix for #11492
but should work for a short term.

Our current retry strategy for walredo is to retry every request exactly
once. This retry doesn't make sense because it retries all requests
exactly once and each error is expected to cause process restart and
cause future requests to fail. I'll explain it with a scenario of two
threads requesting redos: one with an invalid history (that will cause
walredo to panic) and another that has a correct redo sequence.

First let's look at how we handle retries right now in
do_with_walredo_process. At the beginning of the function it will spawn
a new process if there's no existing one. Then it will continue to redo.
If the process fails, the first process that encounters the error will
remove the walredo process object from the OnceCell, so that the next
time it gets accessed, a new process will be spawned; if it is the last
one that uses the old walredo process, it will kill and wait the process
in `drop(proc)`. I'm skeptical whether this works under races but I
think this is not the root cause of the problem. In this retry handler,
if there are N requests attached to a walredo process and the i-th
request fails (panics the walredo), all other N-i requests will fail and
they need to retry so that they can access a new walredo process.

```
time       ---->
proc        A                 None   B
request 1   ^-----------------^ fail
            uses A for redo   replace with None
request 2      ^-------------------- fail
               uses A for redo
request 3             ^----------------^ fail
                      uses A for redo  last ref, wait for A to be killed
request 4                            ^---------------
                                     None, spawn new process B
```

The problem is with our retry strategy. Normally, for a system that we
want to retry on, the probability of errors for each of the requests are
uncorrelated. However, in walredo, a prior request that panics the
walredo process will cause all future walredo on that process to fail
(that's correlated).

So, back to the situation where we have 2 requests where one will
definitely fail and the other will succeed and we get the following
sequence, where retry attempts = 1,

* new walredo process A starts.
* request 1 (invalid) being processed on A and panics A, waiting for
retry, remove process A from the process object.
* request 2 (valid) being processed on A and receives pipe broken /
poisoned process error, waiting for retry, wait for A to be killed --
this very likely takes a while and cannot finish before request 1 gets
processed again
* new walredo process B starts.
* request 1 (invalid) being processed again on B and panics B, the whole
request fail.
* request 2 (valid) being processed again on B, and get a poisoned error
again.

```
time       ---->
proc        A                 None           B                    None
request 1   ^-----------------^--------------^--------------------^
            spawn A for redo  fail          spawn B for redo     fail
request 2      ^--------------------^-------------------------^------------^
               use A for redo       fail, wait to kill A      B for redo   fail again
```

In such cases, no matter how we set n_attempts, as long as the retry
count applies to all requests, this sequence is bound to fail both
requests because of how they get sequenced; while we could potentially
make request 2 successful.

There are many solutions to this -- like having a separate walredo
manager for compactions, or define which errors are retryable (i.e.,
broken pipe can be retried, while real walredo error won't be retried),
or having a exclusive big lock over the whole redo process (the current
one is very fine-grained). In this patch, we go with a simple approach:
use different retry attempts for different types of requests.

For gc-compaction, the attempt count is set to 0, so that it never
retries and consequently stops the compaction process -- no more redo
will be issued from gc-compaction. Once the walredo process gets
restarted, the normal read requests will proceed normally.

## Summary of changes

Add redo_attempt for each reconstruct value request to set different
retry policies.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
Co-authored-by: Erik Grinaker <erik@neon.tech>
## Problem

In some cases gc-compaction doesn't respond to the L0 compaction yield
notifier. I suspect it's stuck on getting the first item, and if so, we
probably need to let L0 yield notifier preempt `next_with_trace`.

## Summary of changes

- Add `time_to_first_kv_pair` to gc-compaction statistics.
- Inverse the ratio so that smaller ratio -> better compaction ratio.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem

`test_location_conf_churn` often fails with `neither image nor delta
layer`, but doesn't say what the file actually is. However, past local
failures have indicated that it might be `.___temp` files.

Touches #11348.

## Summary of changes

Ignore `.___temp` files when evicting local layers, and include the file
name in the error message.
## Problem

Part of #9114 

## Summary of changes

Gc-compaction flag was not correctly set, causing it not getting
preempted by L0.

Signed-off-by: Alex Chi Z <chi@neon.tech>
## Problem

Part of #9114 

## Summary of changes

Gc-compaction flag was not correctly set, causing it not getting
preempted by L0.

Signed-off-by: Alex Chi Z <chi@neon.tech>
It isn't used by the production control plane or neon_local. The removal
simplifies compute spec logic just a little bit more since we can remove
any notion of whether we should allow live reconfigurations.

Signed-off-by: Tristan Partin <tristan@neon.tech>
## Problem
Page service doesn't use TLS for incoming requests.
- Closes: neondatabase/neon-archive-cloud#27236

## Summary of changes
- Add option `enable_tls_page_service_api` to pageserver config
- Propagate `tls_server_config` to `page_service` if the option is
enabled

No integration tests for now because I didn't find out how to call page
service API from python and AFAIK computes don't support TLS yet
## Problem

Walproposer should get elected and commit WAL on safekeepers specified
by the membership configuration.

## Summary of changes

- Add to wp `members_safekeepers` and `new_members_safekeepers` arrays
mapping configuration members to connection slots. Establish this
mapping (by node id) when safekeeper sends greeting, giving its id and
when mconf becomes known / changes.
- Add to TermsCollected, VotesCollected,
GetAcknowledgedByQuorumWALPosition membership aware logic. Currently it
partially duplicates existing one, but we'll drop the latter eventually.
- In python, rename Configuration to MembershipConfiguration for
clarity.
- Add test_quorum_sanity testing new logic.

ref #10851
## Problem

`test_location_conf_churn` performs random location updates on
Pageservers. While doing this, it could instruct the compute to connect
to a stale generation and execute queries. This is invalid, and will
fail if a newer generation has removed layer files used by the stale
generation.

Resolves #11348.

## Summary of changes

Only connect to the latest generation when executing queries.
@vipvap vipvap requested review from a team as code owners April 15, 2025 06:01
@vipvap vipvap requested review from NanoBjorn, arpad-m, conradludgate, knizhnik, piercypixel and tristan957 and removed request for a team April 15, 2025 06:01
@github-actions
Copy link

If this PR added a GUC in the Postgres fork or neon extension,
please regenerate the Postgres settings in the cloud repo:

make NEON_WORKDIR=path/to/neon/checkout \
  -C goapp/internal/shareddomain/postgres generate

If you're an external contributor, a Neon employee will assist in
making sure this step is done.

@github-actions
Copy link

github-actions bot commented Apr 15, 2025

8232 tests run: 7793 passed, 0 failed, 439 skipped (full report)


Flaky tests (9)

Postgres 17

Postgres 16

Postgres 14

Code coverage* (full report)

  • functions: 33.1% (8992 of 27156 functions)
  • lines: 49.0% (77893 of 158822 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
51ecd1b at 2025-04-15T08:44:04.877Z :recycle:

@cloneable
Copy link
Contributor

cloneable commented Apr 15, 2025

@cloneable cloneable requested review from cloneable and removed request for NanoBjorn, arpad-m, conradludgate, knizhnik, piercypixel and tristan957 April 15, 2025 08:12
@cloneable cloneable added this pull request to the merge queue Apr 15, 2025
@vipvap vipvap added the fast-forward Trigger a release PR `--ff-only` merge label Apr 15, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Apr 15, 2025
@vipvap vipvap removed the fast-forward Trigger a release PR `--ff-only` merge label Apr 15, 2025
@vipvap vipvap merged commit 51ecd1b into release-proxy Apr 15, 2025
80 checks passed
@vipvap vipvap deleted the rc/release-proxy/2025-04-15 branch April 15, 2025 09:52
@danieltprice
Copy link
Contributor

Reviewed for changelog

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.