Skip to content

feat: Implement thread-safe rate limiting for network session classes#306

Open
itsmeknt wants to merge 13 commits intospeedyapply:mainfrom
itsmeknt:feat/rate_limiter
Open

feat: Implement thread-safe rate limiting for network session classes#306
itsmeknt wants to merge 13 commits intospeedyapply:mainfrom
itsmeknt:feat/rate_limiter

Conversation

@itsmeknt
Copy link

Summary
This pull request introduces a new, thread-safe rate-limiting capability to the RequestsRotating and TLSRotating session classes. Users can now specify a minimum and maximum delay between requests to avoid overwhelming servers or triggering anti-bot mechanisms. A delay interval, randomly chosen between the minimum and maximum, will then be chosen.

Motivation
When performing automated requests at a high frequency, it's common to encounter rate limits imposed by the target server (e.g., HTTP 429 "Too Many Requests"). This can result in temporary or permanent IP blocks.

Implementing a client-side rate limiter provides a robust mechanism to control request frequency, increasing the reliability and success rate of long-running tasks and promoting more responsible scraping/automation practices.

Implementation Details
New RateLimiter Class:
A dedicated, reusable RateLimiter class has been created to encapsulate all rate-limiting logic.
It uses a threading.Lock to ensure atomicity, making it safe for use across multiple threads sharing the same session object.
It utilizes time.monotonic() for accurate time interval measurement, which is not affected by system time changes.
It supports both fixed delays (by providing rate_delay_min only) and randomized delays within a range (by providing both rate_delay_min and rate_delay_max).

Integration with Session Classes:
The init methods of RequestsRotating and TLSRotating have been updated to accept rate_delay_min and rate_delay_max arguments.
An instance of RateLimiter is created within each session.
A call to self.rate_limiter.enforce_delay() has been added at the very beginning of the request() and execute_request() methods to ensure the delay is enforced before any network activity occurs.

How to Use
Users can now instantiate the session classes with the new rate-limiting parameters:
code

    all_job_posts = scrape_jobs(
        site_name=["indeed"], # , "linkedin", "zip_recruiter", "google"], # "glassdoor", "bayt", "naukri", "bdjobs"
        search_term="software",
        google_search_term="software engineer jobs near San Francisco, CA since yesterday",
        location="San Francisco, CA",
        results_wanted=num_job_posts,
        hours_old=30*24,
        country_indeed='USA',
        rate_delay_min=1,  # in seconds
        rate_delay_max=2,  # in seconds
    )

@zachramsey
Copy link

Bump @cullenwatson
This PR looks good to me -- clean and well-documented. I think this would be a very beneficial enhancement to the scraper. A local rate-limiter offers a safeguard for users who have not set up a proxy, and it is just good practice in general.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants