Skip to content

Conversation

@ellemouton
Copy link
Collaborator

@ellemouton ellemouton commented Dec 4, 2025

fixes race found in this build

See commit messages for more details

The PubKey method had a race condition where concurrent calls could
all pass the nil check and race to write to the cached pubKey field.
This is a classic check-then-act race.

Remove the caching entirely to fix the race. The overhead of parsing
a public key is minimal and doesn't justify the added complexity and
race risk of caching.
Both NodeKey1 and NodeKey2 methods had the same race condition as the
Node.PubKey method, where concurrent calls could race to write to the
cached fields.

Remove the caching for the same reasons: parsing overhead is minimal
and doesn't justify the complexity and race risk.
The DisconnectBlockAtHeight method was modifying the rejectCache and
chanCache without holding the cacheMu lock. This caused races with
other operations that properly held the lock, such as AddChannelEdge
which modifies the caches in its OnCommit callback while the batch
scheduler holds cacheMu.

Fix by acquiring cacheMu before removing channels from the caches.
@gemini-code-assist
Copy link

Summary of Changes

Hello @ellemouton, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses and resolves several race conditions identified within the channel graph database. The primary focus is on ensuring data consistency by removing problematic caching patterns in public key parsing functions and implementing robust mutex-based synchronization for critical cache modification operations. These changes enhance the stability and reliability of the graph database under concurrent access.

Highlights

  • Race Condition Fixes in Public Key Parsing: Eliminated check-then-act race conditions in Node.PubKey() and ChannelEdgeInfo.NodeKey1/NodeKey2() methods by removing the caching of parsed public keys. These methods now parse the public key directly from its byte representation each time they are called.
  • Mutex Protection for Cache Operations: Introduced proper mutex locking (s.cacheMu) to DisconnectBlockAtHeight and updateEdgeCache methods in the SQL store. This ensures thread-safe access to the rejectCache and chanCache, preventing concurrency issues when modifying these shared resources.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request fixes several race conditions in the channel graph database. The main changes are removing racy in-memory caching of public keys in Node and ChannelEdgeInfo and instead parsing them on each access, and adding mutex protection for concurrent access to rejectCache and chanCache in updateEdgeCache and DisconnectBlockAtHeight. The changes correctly address the identified race conditions. However, I've found a critical issue where the lock added in updateEdgeCache may cause a deadlock due to the locking strategy of the batch package. I've also added a medium-severity suggestion for improving code style and safety in DisconnectBlockAtHeight and a note on a placeholder in the release notes. Overall, the direction is good but the potential deadlock needs to be addressed.

Comment on lines +2917 to +2919
s.cacheMu.Lock()
for _, channel := range removedChans {
s.rejectCache.remove(channel.ChannelID)
s.chanCache.remove(channel.ChannelID)
}
s.cacheMu.Unlock()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

While this is correct, using defer s.cacheMu.Unlock() right after the Lock() call is more idiomatic in Go and safer against future modifications that might add an early return path within the locked section. This would also make it consistent with the change in updateEdgeCache.

@yyforyongyu yyforyongyu added this to the v0.20.1 milestone Dec 4, 2025
Copy link
Collaborator

@bitromortac bitromortac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the changes look good, pending a pathfinding benchmark to see if the caching is relevant.


// NodeKey1Bytes is the raw public key of the first node.
NodeKey1Bytes [33]byte
nodeKey1 *btcec.PublicKey
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

caching came from #706, I think it would be good to do a pathfinding benchmark and/or profile

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool yeah the other option is just to only store the *btcec.PublicKey and not the bytes here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

other option is just to leave the caching but use a sync.Once or something

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't see any pathfinding or graph loading performance degradation. I'm not sure about allocations, but haven't spotted anything suspicious and the call sites seem not to be relevant for pathfinding at least.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also checked allocations and couldn't see a difference

@saubyk saubyk added this to lnd v0.20 Dec 4, 2025
@saubyk saubyk moved this to In review in lnd v0.20 Dec 4, 2025
@saubyk saubyk requested a review from yyforyongyu December 4, 2025 15:21
Copy link
Collaborator

@bitromortac bitromortac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In review

Development

Successfully merging this pull request may close these issues.

3 participants