-
Notifications
You must be signed in to change notification settings - Fork 2.2k
graph: fix various races #10420
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
graph: fix various races #10420
Conversation
The PubKey method had a race condition where concurrent calls could all pass the nil check and race to write to the cached pubKey field. This is a classic check-then-act race. Remove the caching entirely to fix the race. The overhead of parsing a public key is minimal and doesn't justify the added complexity and race risk of caching.
Both NodeKey1 and NodeKey2 methods had the same race condition as the Node.PubKey method, where concurrent calls could race to write to the cached fields. Remove the caching for the same reasons: parsing overhead is minimal and doesn't justify the complexity and race risk.
The DisconnectBlockAtHeight method was modifying the rejectCache and chanCache without holding the cacheMu lock. This caused races with other operations that properly held the lock, such as AddChannelEdge which modifies the caches in its OnCommit callback while the batch scheduler holds cacheMu. Fix by acquiring cacheMu before removing channels from the caches.
Summary of ChangesHello @ellemouton, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses and resolves several race conditions identified within the channel graph database. The primary focus is on ensuring data consistency by removing problematic caching patterns in public key parsing functions and implementing robust mutex-based synchronization for critical cache modification operations. These changes enhance the stability and reliability of the graph database under concurrent access. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
49ccdbe to
33b1727
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request fixes several race conditions in the channel graph database. The main changes are removing racy in-memory caching of public keys in Node and ChannelEdgeInfo and instead parsing them on each access, and adding mutex protection for concurrent access to rejectCache and chanCache in updateEdgeCache and DisconnectBlockAtHeight. The changes correctly address the identified race conditions. However, I've found a critical issue where the lock added in updateEdgeCache may cause a deadlock due to the locking strategy of the batch package. I've also added a medium-severity suggestion for improving code style and safety in DisconnectBlockAtHeight and a note on a placeholder in the release notes. Overall, the direction is good but the potential deadlock needs to be addressed.
| s.cacheMu.Lock() | ||
| for _, channel := range removedChans { | ||
| s.rejectCache.remove(channel.ChannelID) | ||
| s.chanCache.remove(channel.ChannelID) | ||
| } | ||
| s.cacheMu.Unlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
33b1727 to
1acf309
Compare
bitromortac
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the changes look good, pending a pathfinding benchmark to see if the caching is relevant.
|
|
||
| // NodeKey1Bytes is the raw public key of the first node. | ||
| NodeKey1Bytes [33]byte | ||
| nodeKey1 *btcec.PublicKey |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
caching came from #706, I think it would be good to do a pathfinding benchmark and/or profile
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool yeah the other option is just to only store the *btcec.PublicKey and not the bytes here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
other option is just to leave the caching but use a sync.Once or something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't see any pathfinding or graph loading performance degradation. I'm not sure about allocations, but haven't spotted anything suspicious and the call sites seem not to be relevant for pathfinding at least.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also checked allocations and couldn't see a difference
bitromortac
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🎉
fixes race found in this build
See commit messages for more details