Skip to content

Conversation

@SoggyRhino
Copy link
Contributor

@SoggyRhino SoggyRhino commented Sep 12, 2025

I got carried away and basically rewrote the whole thing but I think its a lot cleaner now.

Resume

  • Treats the outDir as the source of truth
  • If resume flag is true then we find the last prefix dir that has all the {id}.html files as specified by the coursebook and make that the starting prefix.
  • If a startPrefix is specified then we will start from there regardless of resume
  • And finally if resume is passed we will not re scrape any of the sections that we already have in the outdir, if it is not then any sections we already have will be overwritten.
  • The scraper should behave exactly as it did before if resume is not passed.

Validate

  • Goes over the out dir and checks to see that all prefixs are complete

Logging

  • I added more logging specifically regarding timings so that it was easier to figure out what was going on when the program as hanging.

Rate Limiting

  • Reduced sleep between sections from 3s to 400ms, saw no issues in stability on my end
  • Added 10s timeout to http client to avoid hanging. Also the only times when I got close to 10s was when I think it was rate limiting us, it was more efficient to just try again.
  • Removed new token for every prefix as there really isn't a point.
  • Should now finish several times faster

@mikehquan19
Copy link
Contributor

I will review them soon. Thanks!
I think refactoring the code to be cleaner is really part of what I want to improve for the codebase.

@mikehquan19 mikehquan19 self-requested a review September 23, 2025 16:54
Copy link
Contributor

@mikehquan19 mikehquan19 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. I think if -resume flag is functioning consistently moving forward, we might not need -startPrefix anymore.

I will do some clean-up with the logging, throttle, and token refresh to bring back the status quo, making sure it can work on every end.

And btw, next time try to implement new features without rewriting too much.

@mikehquan19 mikehquan19 merged commit cc04bdd into UTDNebula:develop Sep 24, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants