htlcswitch: add htlc state machine fuzz tests#10773
htlcswitch: add htlc state machine fuzz tests#10773NishantBansal2003 wants to merge 1 commit intolightningnetwork:masterfrom
Conversation
Signed-off-by: Nishant Bansal <nishant.bansal.282003@gmail.com>
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a robust fuzz testing framework for the HTLC switch state machine in LND. By simulating various network conditions, message sequences, and node lifecycle events, the new tests aim to ensure the stability and correctness of the HTLC state machine under adversarial or malformed input scenarios. Additionally, minor improvements and safety fixes were applied to the channel database and test utilities to support these new testing capabilities. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
🔴 PR Severity: CRITICAL
🔴 Critical (3 files)
🟢 Low / Excluded (2 files — test files, not counted)
AnalysisThis PR touches two CRITICAL packages: channeldb and htlcswitch. Both are core to Lightning Network safety — The bulk of the line additions (1,632) come from No severity bump was triggered: non-test file count (3) is well under 20, and non-test lines changed (~126) are well under 500. To override, add a |
There was a problem hiding this comment.
Code Review
This pull request implements a comprehensive fuzz testing framework for the HTLC state machine, including a new fuzz_test.go file and updates to related mocks and test utilities. It also refactors AdvanceCommitChainTail in channeldb to handle missing unsigned updates more gracefully. The review feedback highlights that the fuzzer should handle signature parsing errors by skipping them instead of using require.NoError, which causes the test to fail prematurely. Additionally, a function comment requires a minor grammatical fix to adhere to the repository's style guide.
| sig, err := lnwire.NewSigFromWireECDSA(b) | ||
| require.NoError(fn.t, err) | ||
| out.CommitSig = sig |
There was a problem hiding this comment.
Using require.NoError here will cause the fuzzer to stop and report a failure whenever the random bytes provided by the fuzzer do not form a valid signature (e.g., due to scalar overflow). Since the fuzzer is intended to explore the state space with random mutations, it should handle such parsing errors gracefully by skipping the mutation instead of failing the test.
sig, err := lnwire.NewSigFromWireECDSA(b)
if err != nil {
return
}
out.CommitSig = sig| sig, err := lnwire.NewSigFromWireECDSA(fn.getBytes(64)) | ||
| require.NoError(fn.t, err) | ||
| mutatedHtlcSigs = append(mutatedHtlcSigs, sig) |
There was a problem hiding this comment.
Similar to the CommitSig mutation, using require.NoError when parsing random bytes into an HTLC signature will cause the fuzzer to crash on validly generated random data that happens to be an invalid scalar. This should be handled by skipping the addition of the signature if parsing fails.
sig, err := lnwire.NewSigFromWireECDSA(fn.getBytes(64))
if err != nil {
continue
}
mutatedHtlcSigs = append(mutatedHtlcSigs, sig)| } | ||
| } | ||
|
|
||
| // maybeReorderMessages conditionally reorder outbound messages from the remote |
There was a problem hiding this comment.
The function comment for maybeReorderMessages contains a grammatical error. It should use the third-person singular form "reorders" to be a complete sentence, as required by the style guide.
| // maybeReorderMessages conditionally reorder outbound messages from the remote | |
| // maybeReorderMessages conditionally reorders outbound messages from the remote |
References
- Function comments should be complete sentences and begin with the function name. (link)
There was a problem hiding this comment.
Hello @NishantBansal2003, as a first step in the review I ran the fuzz test on a dedicated machine (24 cores, 64GB RAM), and as you can see below, the fuzzing process eventually hung.
For context, I’m also working on a similar problem (link), and I ran into the same issue. I was able to resolve it by mocking signature operations (sign/verify), key derivation, and running the database in memory. This increased the execution rate from a few executions per second (similar to what you’re seeing) to hundreds per second.
go test -v -run=^$ -fuzz=FuzzHTLCStates -parallel=16
fuzz: elapsed: 41m6s, execs: 74208 (3/sec), new interesting: 541 (total: 544)
fuzz: elapsed: 41m9s, execs: 74211 (1/sec), new interesting: 542 (total: 545)
fuzz: elapsed: 41m12s, execs: 74214 (1/sec), new interesting: 544 (total: 547)
fuzz: elapsed: 41m15s, execs: 74220 (2/sec), new interesting: 544 (total: 547)
fuzz: elapsed: 41m18s, execs: 74227 (2/sec), new interesting: 544 (total: 547)
fuzz: elapsed: 41m21s, execs: 74235 (3/sec), new interesting: 544 (total: 547)
fuzz: elapsed: 41m24s, execs: 74240 (2/sec), new interesting: 544 (total: 547)
fuzz: elapsed: 41m27s, execs: 74248 (3/sec), new interesting: 546 (total: 549)
fuzz: elapsed: 41m30s, execs: 74254 (2/sec), new interesting: 548 (total: 551)
fuzz: elapsed: 41m33s, execs: 74262 (3/sec), new interesting: 549 (total: 552)
fuzz: elapsed: 41m36s, execs: 74271 (3/sec), new interesting: 550 (total: 553)
fuzz: elapsed: 41m39s, execs: 74292 (7/sec), new interesting: 550 (total: 553)
fuzz: elapsed: 41m40s, execs: 74292 (0/sec), new interesting: 550 (total: 553)
--- FAIL: FuzzHTLCStates (2500.07s)
fuzzing process hung or terminated unexpectedly while minimizing: EOF
Failing input written to testdata/fuzz/FuzzHTLCStates/72e690e1fb2396a9
To re-run:
go test -run=FuzzHTLCStates/72e690e1fb2396a9
=== NAME
FAIL
exit status 1
FAIL github.com/lightningnetwork/lnd/htlcswitch 2500.205s
root@m24314:~/lnd/htlcswitch# go test -run=FuzzHTLCStates/72e690e1fb2396a9
PASS
ok github.com/lightningnetwork/lnd/htlcswitch 0.447s
This commit adds fuzz tests for the normal channel operations state machine. The fuzzer provides inputs, and based on those inputs, we simulate various random HTLC switch states. These include sending add_htlc, commit_sig, update_fee, revoke_ack, fulfill_htlc, fail_htlc, malformed_htlc, updating the block height, and restarting the node. Both valid and malformed messages, along with invalid HTLC ordering, are used to help uncover any unexpected behavior caused by malicious nodes.
In these fuzz tests, there are local and remote channel links. The local node is honest, while the remote node can behave either normally or as an attacker and may act arbitrarily. The main property we are testing is that no state should cause a panic in the victim node. Any link failures on the victim node are expected (we do not check link failures on the attacker node, since arbitrary messages sent to the victim may result in unexpected responses being sent back). Additionally, after each state transition, channel invariants are verified, such as total balance consistency, no loss of funds, and ensuring that the commitment height delta between peers is at most 1.
A lot of thorough review has been done in NishantBansal2003#7, and I think it is now a good time to open this upstream officially. After running it for a long time, I haven’t found any major bugs in the htlcswitch codebase. Also, this PR depends on #10622, so for now I have included the changes from that PR, but will rebase once it is merged.
Since these fuzz tests cover extensive state combinations, I believe it would be beneficial to run them continuously. This would serve as a regression check for the existing code and could help uncover bugs introduced by future changes.
cc: @morehouse