Skip to content

Conversation

@MFormenti
Copy link

@MFormenti MFormenti commented Nov 19, 2025

Adds infrastructure to support partial results by providing scan implementations access to their ScanJobDescription context during execution

Added:

  • JobStatus.RUNNING enum value for in-progress scans
  • getCurrentJobDescription() method in BulkScanWorker to access current job UUID via ThreadLocal
  • ThreadLocal context management in handle() method

mattiaformenti added 3 commits November 14, 2025 10:39
- Add RUNNING status to JobStatus enum
- Add ThreadLocal in BulkScanWorker to pass ScanJobDescription to scan implementations
- Add upsertPartialScanResult() method to IPersistenceProvider interface
- Implement upsertPartialScanResult() in MongoPersistenceProvider
- Update DummyPersistenceProvider test implementation
- Update BulkScanWorkerManager to pass ScanJobDescription to worker.handle()

This enables real-time persistence of intermediate scan results to MongoDB as probes complete during TLS scans.
@MFormenti MFormenti changed the title Add support for scan partial results tracking Add support for partial results tracking on a scan Nov 19, 2025
… input.

scanTarget can be accessed from scanJobDescription, and the UUID in scanJobDescription is needed to write partial results to MongoDB during a scan
@juaninf
Copy link
Contributor

juaninf commented Nov 20, 2025

Add one more test using DummyPersistenceProvider to check the partial results.

Copy link
Contributor

@juaninf juaninf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check my comments

mattiaformenti added 3 commits November 26, 2025 11:29
@ic0ns
Copy link
Contributor

ic0ns commented Nov 26, 2025

looks good to me @XoMEX are you ok with this?

@XoMEX
Copy link
Member

XoMEX commented Dec 1, 2025

I don't get what getCurrentJobDescription is used for. Can you explain who calls this function and to do what.
Right now it feels weird to put a variable that is bound to the stack into a threadlocal. I feel like the current threading model is a bit hacky and do not feel comfortable making assumptions about it.

@MFormenti
Copy link
Author

getCurrentJobDescription is used by a function ( TlsScanWorker.scan() ) that I modified in TLS-Crawler-Development (there is currently an open draft PR in TLS-Crawler-Development).
The reason why I modified this function is because we want to track partial results during a scan to updated our webapp's UI as the scan progresses, without needing to wait for the whole scan to finish. Partial results are stored in a MongoDB database, so to write partial results to MongoDB while the scan is running, the scan function needs access to the UUID of a scan.

@XoMEX
Copy link
Member

XoMEX commented Dec 1, 2025

Do I get it correctly, that you modify the mongodb also from the TLS-Crawler then?
To me this feels like a bad idea to do this in two places. I'd propose to modify the Future returned by handle to support returning partial results which are then persisted into the db by the crawler core.

…al results persistence.

Rework Worker, BulkScanWorker and BulkScanWorkerManager handle method to use ScheduledScan instead of Future<Document> and removed use of ThreadLocal.
@MFormenti
Copy link
Author

yes, that is correct, mongoDB was also modified in TLS-Crawler. I changed it so now mongoDB is only modified in Crawler Core.
I added a new class, called ScheduleScan, that acts as a wrapper for Future<Document> and implements partial results persistence.
Additionally, ThreadLocal is no longer needed in Crawler Core so I removed it.

Copy link
Member

@XoMEX XoMEX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool that you got it working that fast.
I am still a bit unhappy about how the partial results are passed back to the scanner. In my head I envision that the report is observable, and as soon as it is updated, the new data is persisted into the DB. But that comes with other issues (like, do we want to write all changes?) and a redesign of the scanner result class. I am not sure if it is worth the effort.

Other than that, I left some minor comments

Comment on lines +26 to +29
public class ScheduledScan {

private volatile Document currentResult;
private final CompletableFuture<Document> finalResult = new CompletableFuture<>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reasoning for wrapping the future instead of extending it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also find the name to not capture the functionality.
If this would extend the future, I'd call it something like ProgressableFuture<Document>.

Comment on lines +196 to +198
public void setPersistenceProvider(IPersistenceProvider persistenceProvider) {
this.persistenceProvider = persistenceProvider;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we instead pass this in the constructor? Is there a use-case to having this as a variable instead of a constant?

* @param partialResult The partial result document to persist
*/
protected void persistPartialResult(ScanJobDescription jobDescription, Document partialResult) {
if (persistenceProvider != null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this check?

Comment on lines -101 to +127
public abstract Document scan(ScanTarget scanTarget);
public abstract Document scan(ScanJobDescription jobDescription, ScheduledScan scheduledScan);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dislike having an "out parameter". The only alternative I see would be a functional interface to feed the partial results into. But I guess that isn't that much of a difference.

* @param jobDescription The job description for the scan
* @param partialResult The partial result document to persist
*/
protected void persistPartialResult(ScanJobDescription jobDescription, Document partialResult) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this not somehow linked to ScheduledScan.updateResult?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see two approaches:
Either this function updates the results in the ScheduledScan object
Or this function is registered as an observer for updates on the ScheduledScan object; i.e. if updateResult is called, it fires an even that causes persistPratialResult to be called.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or are there cases where one function is called and not the other?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants