-
Notifications
You must be signed in to change notification settings - Fork 1.3k
p2p: batch publish data column sidecars #16183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
a relatively small change to optimize network send order. Without this, network writes tend to prioritize sending data for one column to all peers before sending data for later columns (e.g for two columns and 4 peers per column it would send A,A,A,A,B,B,B,B). With batch publishing we can change the write order to round robin across columns (e.g. A,B,A,B,A,B,A,B). In cases where the process is sending at a rate over the network limit, this approach allows at least some copies of the column to propagate through the network. In early simulations with bandwidth limits of 50mbps for the publisher, this improved dissemination by ~20-30%.
|
@kasey Kurtosis testing looks good: Here is the Kurtosis config file: Here are the logs showing batch publishing of data columns (I added these log lines locally, haven't committed them):
|
| if seen[topic] == 0 { | ||
| t.Errorf("Expected topic %s to be sent at least once", topic) | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The above test check the new behavior by verifying the following invariants:
- all expected topics are seen at least once
- no publishes occur on the same topic twice, until all expected topics have had 1 publish
I think we should also have a test case making sure that each peer is given each expected message. An example bug that the current test assertions would miss is "only publish each message to 1 peer".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
beacon-chain/p2p/pubsub.go
Outdated
| case <-ctx.Done(): | ||
| return errors.Wrapf(ctx.Err(), "unable to find requisite number of peers for topic %s, 0 peers found to publish to", topic) | ||
| default: | ||
| time.Sleep(100 * time.Millisecond) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
select will enter the default case immediately if ctx.Done blocks, at which point the goroutine will sleep - not checking for context cancellation, and then going back into the loop where we may call AddToBatch despite the context being canceled during the sleep. This is why we should generally prefer concurrent context checks until the deadline hits using time.After, like this:
for {
if len(topicHandle.ListPeers()) > 0 || flags.Get().MinimumSyncPeers == 0 {
return topicHandle.AddToBatch(ctx, batch, data, opts...)
}
select {
case <-ctx.Done():
return errors.Wrapf(ctx.Err(), "unable to find requisite number of peers for topic %s, 0 peers found to publish to", topic)
case <-time.After(100 * time.Millisecond):
// after 100ms, reenter the for loop
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
beacon-chain/p2p/pubsub.go
Outdated
| func (s *Service) addToBatch(ctx context.Context, batch *pubsub.MessageBatch, topic string, data []byte, opts ...pubsub.PubOpt) error { | ||
| topicHandle, err := s.JoinTopic(topic) | ||
| if err != nil { | ||
| return err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please wrap errors.
beacon-chain/p2p/broadcaster.go
Outdated
| // method to broadcast or batch messages to other peers in our gossip mesh. If | ||
| // batch is non-nil the message is added to the batch WITHOUT publishing. The | ||
| // caller MUST publish the batch after all messages have been added to the batch | ||
| func (s *Service) broadcastOrBatchObject(ctx context.Context, batch *pubsub.MessageBatch, obj ssz.Marshaler, topic string) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand you created this broadcastOrBatchObject to avoid some code duplication with broadcastObject and batchObject.
However in this precise case, I think it would be better to:
- remove this
broadcastOrBatchObject - create a subfunction running the common code
- call this subfunction both in
broadcastObjectandbatchObject.
(Also, the span p2p.broadcastObject is now wrong.)
beacon-chain/p2p/broadcaster.go
Outdated
|
|
||
| // Broadcast the data column sidecar to the network. | ||
| if err := s.broadcastObject(ctx, sidecar, topic); err != nil { | ||
| if err := s.batchObject(ctx, &messageBatch, sidecar, topic); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before your PR:
If, for a given subnet, no peers were found, then the sidecars corresponding to the other subnets were broadcasted.
After your PR:
No sidecar will be broadcasted at all until the required number of peers (1) is found in all subnets for which sidecars must be broadcasted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nalepae I agree. I have two thoughts on this:
-
With pro-active peer discovery and connectivity for subnets in https://github.com/OffchainLabs/prysm/pull/16036/files, not having peers for a given subnet will be less of a problem.
-
We could periodically batch publish here and empty the batch if timeouts are hit i.e. if it's been say 1s since a publish, just publish the batch right away and continue the batching.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about first split in two categories sidecars with enough peers and sidecars without enough peers.
For sidecars with enough peers, let's keep the new mechanism (batch publishing).
For sidecars without peers, let's then start looking for new peers (one goroutine per sidecar) and publish as soon as there is enough peers (in the same goroutine)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Testing on Kurtosis now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nalepae Please can you do one more review ?
Co-authored-by: Preston Van Loon <pvanloon@offchainlabs.com>
Co-authored-by: Preston Van Loon <pvanloon@offchainlabs.com>
.gitignore
Outdated
| execution/ | ||
|
|
||
| # AI assistant files | ||
| CLAUDE.md |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually currently we don't have a CLAUDE.md file for Prysm. I think:
- we should have one
- we should NOT ignore it.
If you want to have a personal CLAUDE.md file, it's better to use CLAUDE.local.md. (This one, ignored by git.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: No new line at the end of file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the CLAUDE.md file for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally prefer folks to use .git/info/exclude rather than edit the repo gitignore.
|
@nalepae Kurtosis testing looks good:
|
|
Looks okay so far on Hoodi, nothing fishy:
|
Co-authored-by: Manu NALEPA <enalepa@offchainlabs.com>




What type of PR is this?
Feature
What does this PR do? Why is it needed?
This PR takes @MarcoPolo 's PR at #16130 to completion with tests.
The description on his PR:
"""
a relatively small change to optimize network send order.
Without this, network writes tend to prioritize sending data for one column to all peers before sending data for later columns (e.g for two columns and 4 peers per column it would send A,A,A,A,B,B,B,B). With batch publishing we can change the write order to round robin across columns (e.g. A,B,A,B,A,B,A,B).
In cases where the process is sending at a rate over the network limit, this approach allows at least some copies of the column to propagate through the network. In early simulations with bandwidth limits of 50mbps for the publisher, this improved dissemination by ~20-30%.
"""
See the issue for some more context.
Which issues(s) does this PR fix?
Fixes #16129
Other notes for review
Acknowledgements