Skip to content

Conversation

@aarshkshah1992
Copy link
Contributor

@aarshkshah1992 aarshkshah1992 commented Dec 22, 2025

What type of PR is this?

Feature

What does this PR do? Why is it needed?

This PR takes @MarcoPolo 's PR at #16130 to completion with tests.

The description on his PR:

"""
a relatively small change to optimize network send order.

Without this, network writes tend to prioritize sending data for one column to all peers before sending data for later columns (e.g for two columns and 4 peers per column it would send A,A,A,A,B,B,B,B). With batch publishing we can change the write order to round robin across columns (e.g. A,B,A,B,A,B,A,B).

In cases where the process is sending at a rate over the network limit, this approach allows at least some copies of the column to propagate through the network. In early simulations with bandwidth limits of 50mbps for the publisher, this improved dissemination by ~20-30%.
"""
See the issue for some more context.

Which issues(s) does this PR fix?

Fixes #16129

Other notes for review

Acknowledgements

  • I have read CONTRIBUTING.md.
  • I have included a uniquely named changelog fragment file.
  • I have added a description with sufficient context for reviewers to understand this PR.
  • I have tested that my changes work as expected and I added a testing plan to the PR description (if applicable).

MarcoPolo and others added 3 commits December 10, 2025 09:29
a relatively small change to optimize network send order.

Without this, network writes tend to prioritize sending data for one column
to all peers before sending data for later columns (e.g for two columns
and 4 peers per column it would send A,A,A,A,B,B,B,B). With batch
publishing we can change the write order to round robin across columns
(e.g. A,B,A,B,A,B,A,B).

In cases where the process is sending at a rate over the network limit,
this approach allows at least some copies of the column to propagate
through the network. In early simulations with bandwidth limits of
50mbps for the publisher, this improved dissemination by ~20-30%.
@aarshkshah1992
Copy link
Contributor Author

@kasey Kurtosis testing looks good:

Here is the Kurtosis config file:

participants:
  # Super-nodes
  - el_type: geth
    el_image: ethpandaops/geth:master
    cl_type: prysm
    vc_image: gcr.io/offchainlabs/prysm/validator:latest
    cl_image: prysm-bn-custom-image:latest
    count: 2
    cl_extra_params:
      - --supernode
      - --subscribe-all-subnets
    vc_extra_params:
      - --verbosity=info
  # Full-nodes
  - el_type: geth
    el_image: ethpandaops/geth:master
    cl_type: prysm
    vc_image: gcr.io/offchainlabs/prysm/validator:latest
    cl_image: prysm-bn-custom-image:latest
    count: 2
    validator_count: 4
    cl_extra_params:
      - --semi-supernode
      - --subscribe-all-subnets
    vc_extra_params:
      - --verbosity=info
  - el_type: geth
    el_image: ethpandaops/geth:master
    cl_type: lighthouse
    cl_image: ethpandaops/lighthouse:unstable
    count: 2
    cl_extra_params:
      - --subscribe-all-subnets


additional_services:
  - dora
  - spamoor

spamoor_params:
  image: ethpandaops/spamoor:master
  max_mem: 4000
  spammers:
    - scenario: eoatx
      config:
        throughput: 200
    - scenario: blobs
      config:
        throughput: 20

network_params:
  fulu_fork_epoch: 0
  withdrawal_type: "0x02"
  preset: mainnet
  seconds_per_slot: 6

global_log_level: info

Here is the DORA screenshot:
image

Here are the logs showing batch publishing of data columns (I added these log lines locally, haven't committed them):

image

@aarshkshah1992 aarshkshah1992 requested a review from kasey December 23, 2025 04:15
terencechain
terencechain previously approved these changes Dec 23, 2025
if seen[topic] == 0 {
t.Errorf("Expected topic %s to be sent at least once", topic)
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above test check the new behavior by verifying the following invariants:

  • all expected topics are seen at least once
  • no publishes occur on the same topic twice, until all expected topics have had 1 publish

I think we should also have a test case making sure that each peer is given each expected message. An example bug that the current test assertions would miss is "only publish each message to 1 peer".

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

case <-ctx.Done():
return errors.Wrapf(ctx.Err(), "unable to find requisite number of peers for topic %s, 0 peers found to publish to", topic)
default:
time.Sleep(100 * time.Millisecond)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

select will enter the default case immediately if ctx.Done blocks, at which point the goroutine will sleep - not checking for context cancellation, and then going back into the loop where we may call AddToBatch despite the context being canceled during the sleep. This is why we should generally prefer concurrent context checks until the deadline hits using time.After, like this:

	for {
		if len(topicHandle.ListPeers()) > 0 || flags.Get().MinimumSyncPeers == 0 {
			return topicHandle.AddToBatch(ctx, batch, data, opts...)
		}
		select {
		case <-ctx.Done():
			return errors.Wrapf(ctx.Err(), "unable to find requisite number of peers for topic %s, 0 peers found to publish to", topic)
		case <-time.After(100 * time.Millisecond):
			// after 100ms, reenter the for loop
		}
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

func (s *Service) addToBatch(ctx context.Context, batch *pubsub.MessageBatch, topic string, data []byte, opts ...pubsub.PubOpt) error {
topicHandle, err := s.JoinTopic(topic)
if err != nil {
return err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please wrap errors.

// method to broadcast or batch messages to other peers in our gossip mesh. If
// batch is non-nil the message is added to the batch WITHOUT publishing. The
// caller MUST publish the batch after all messages have been added to the batch
func (s *Service) broadcastOrBatchObject(ctx context.Context, batch *pubsub.MessageBatch, obj ssz.Marshaler, topic string) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand you created this broadcastOrBatchObject to avoid some code duplication with broadcastObject and batchObject.
However in this precise case, I think it would be better to:

  • remove this broadcastOrBatchObject
  • create a subfunction running the common code
  • call this subfunction both in broadcastObject and batchObject.

(Also, the span p2p.broadcastObject is now wrong.)


// Broadcast the data column sidecar to the network.
if err := s.broadcastObject(ctx, sidecar, topic); err != nil {
if err := s.batchObject(ctx, &messageBatch, sidecar, topic); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before your PR:
If, for a given subnet, no peers were found, then the sidecars corresponding to the other subnets were broadcasted.

After your PR:
No sidecar will be broadcasted at all until the required number of peers (1) is found in all subnets for which sidecars must be broadcasted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nalepae I agree. I have two thoughts on this:

  1. With pro-active peer discovery and connectivity for subnets in https://github.com/OffchainLabs/prysm/pull/16036/files, not having peers for a given subnet will be less of a problem.

  2. We could periodically batch publish here and empty the batch if timeouts are hit i.e. if it's been say 1s since a publish, just publish the batch right away and continue the batching.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about first split in two categories sidecars with enough peers and sidecars without enough peers.
For sidecars with enough peers, let's keep the new mechanism (batch publishing).
For sidecars without peers, let's then start looking for new peers (one goroutine per sidecar) and publish as soon as there is enough peers (in the same goroutine)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Testing on Kurtosis now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nalepae Please can you do one more review ?

aarshkshah1992 and others added 5 commits January 5, 2026 11:27
Co-authored-by: Preston Van Loon <pvanloon@offchainlabs.com>
Co-authored-by: Preston Van Loon <pvanloon@offchainlabs.com>
.gitignore Outdated
execution/

# AI assistant files
CLAUDE.md
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually currently we don't have a CLAUDE.md file for Prysm. I think:

  • we should have one
  • we should NOT ignore it.

If you want to have a personal CLAUDE.md file, it's better to use CLAUDE.local.md. (This one, ignored by git.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: No new line at the end of file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the CLAUDE.md file for now.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally prefer folks to use .git/info/exclude rather than edit the repo gitignore.

@aarshkshah1992
Copy link
Contributor Author

@nalepae Kurtosis testing looks good:

image image

@aarshkshah1992
Copy link
Contributor Author

Looks okay so far on Hoodi, nothing fishy:

d9851cc652fe7888402577879e5f2db602ef33bbc7d06be5b7c99f slot=2117192 [2026-01-05 18:58:25.46] INFO blockchain: Called fork choice updated with optimistic block finalizedPayloadBlockHash=0xba19beeb3797 headPayloadBlockHash=0xdeb732634435 headSlot=2117192 [2026-01-05 18:58:25.47] INFO blockchain: Synced new block block=0x87ebf2f6... epoch=66162 finalizedEpoch=66160 finalizedRoot=0x9e2bc186... slot=2117192 [2026-01-05 18:58:25.47] INFO blockchain: Finished applying state transition attestations=8 kzgCommitmentCount=21 payloadHash=0xdeb732634435 slot=2117192 syncBitsCount=456 txCount=122 [2026-01-05 18:58:36.49] INFO blockchain: Called new payload with optimistic block parentRoot=0x87ebf2f610d9851cc652fe7888402577879e5f2db602ef33bbc7d06be5b7c99f payloadBlockHash=0xcd5f8eec9289 root=0x77615f0591fb354540ddea9695a5871e946a6607c3126394b88c7783b7a3273c slot=2117193 [2026-01-05 18:58:36.75] INFO blockchain: Called fork choice updated with optimistic block finalizedPayloadBlockHash=0xba19beeb3797 headPayloadBlockHash=0xcd5f8eec9289 headSlot=2117193 [2026-01-05 18:58:36.76] INFO blockchain: Synced new block block=0x77615f05... epoch=66162 finalizedEpoch=66160 finalizedRoot=0x9e2bc186... slot=2117193 [2026-01-05 18:58:36.76] INFO blockchain: Finished applying state transition attestations=8 kzgCommitmentCount=5 payloadHash=0xcd5f8eec9289 slot=2117193 syncBitsCount=444 txCount=53 [2026-01-05 18:58:49.19] INFO blockchain: Called new payload with optimistic block parentRoot=0x77615f0591fb354540ddea9695a5871e946a6607c3126394b88c7783b7a3273c payloadBlockHash=0x517e232ba353 root=0x55a5cad42fe18b398ae9be9cee4f171fc29e2367af3f510dfca728d61601f1e7 slot=2117194 [2026-01-05 18:58:49.80] INFO blockchain: Called fork choice updated with optimistic block finalizedPayloadBlockHash=0xba19beeb3797 headPayloadBlockHash=0x517e232ba353 headSlot=2117194 [2026-01-05 18:58:49.81] INFO blockchain: Synced new block block=0x55a5cad4... epoch=66162 finalizedEpoch=66160 finalizedRoot=0x9e2bc186... slot=2117194 [2026-01-05 18:58:49.81] INFO blockchain: Finished applying state transition attestations=8 kzgCommitmentCount=21 payloadHash=0x517e232ba353 slot=2117194 syncBitsCount=446 txCount=90

Co-authored-by: Manu NALEPA <enalepa@offchainlabs.com>
@kasey kasey enabled auto-merge January 5, 2026 21:39
@kasey kasey added this pull request to the merge queue Jan 5, 2026
Merged via the queue into develop with commit cc4510b Jan 5, 2026
19 checks passed
@kasey kasey deleted the batch-publish-data-columns branch January 5, 2026 22:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Use of Gossipsub .PublishBatch

7 participants