A threaded GitHub repository scraper for fetching repository trees and content using multiple tokens to handle rate limits.
- Multi-threaded scraping of GitHub repositories.
- Handles GitHub API rate limits by cycling through provided tokens.
- Fetches repository trees and blobs.
- Saves structured tree data and content to disk.
- Customizable query, output directory, and threading options.
from ghtoolscraper_rchotacode import ThreadedScraper
tokens = ["ghp_XXXX", "ghp_YYYY"] # List of GitHub tokens
scraper = ThreadedScraper(
query="language:python",
target="README.md",
tokens=tokens,
per_page=10,
max_threads=5,
output_dir="./repos"
)
scraper.scrape(page_start=1)- Python 3.7+
ghtoolscraper_rchotacodepackage installed through the pyproject.toml
query: GitHub search query.target: File or directory to target in the repo tree.tokens: List of GitHub API tokens.per_page: Results per page (default: 10).max_threads: Number of concurrent threads (default: 5).output_dir: Directory to save results (default:./repos).timeout: Delay between requests (default: 1 second).
- Repository trees and content are saved in the specified output directory as JSON files.
MIT