A headless TypeScript pipeline that discovers GitHub developers, scores their open-source contributions, and stores ranked profiles in PostgreSQL — built for leaderboards and talent discovery.
Discovery → Scrape → Score → Analyze, in four stages:
- Discover — Finds GitHub users by location and follower range via the Search API
- Scrape — Fetches repos, languages, topics, and merged PRs via GraphQL
- Score — Computes
score = stars × (userPRs / totalPRs)per repo, then sums to a total - Analyze — Categorizes each user across five skill areas: AI, Backend, Frontend, DevOps, Data
All GraphQL responses are cached in PostgreSQL (SHA-256 keyed, 30-day TTL). Multiple GitHub tokens are rotated automatically based on remaining quota.
Prerequisites: Node.js 20+, pnpm, PostgreSQL
git clone https://github.com/chemicoholic21/github-data-pipeline.git
cd github-data-pipeline
pnpm install
cp .env.example .env # add DATABASE_URL and GitHub tokens
pnpm db:pushRun the pipeline:
pnpm bulk-discover "Chennai"
pnpm bulk-discover "San Francisco" 0 1 # with start index and pageThis pipeline:
- Discovers GitHub users by location and follower range using the GitHub Search API (
@octokit/rest) - Stage 1 (SCRAPE): Fetches deep profile data (repos, languages, topics, merged PRs) via the GitHub GraphQL API →
github_users,github_repos,github_pull_requests - Stage 2 (COMPUTE): Calculates per-repository scores using
score = stars × (userPRs / totalPRs)→user_repo_scores - Stage 3 (AGGREGATE): Sums all repo scores →
user_scores, syncs toleaderboard - Stage 4 (ANALYZE): Categorizes repos by topics/languages, computes skill scores across five categories (AI, Backend, Frontend, DevOps, Data) →
analyses - Caches all GitHub API responses in
api_cache(SHA-256 keyed) to avoid redundant requests - Manages a pool of multiple GitHub tokens, automatically rotating to the token with the highest remaining quota via
token_rate_limit
⚠️ Scraping GitHub data takes a long time — hours for large regions. Always run inside a tmux session so it survives disconnects.tmux new -s pipeline # start a named session # ... run your command ... # Ctrl+B, D to detach # safely disconnect tmux attach -t pipeline # reattach later
Discovers developers by location, fetches their repos and PRs, scores them, and writes everything to the database.
# Single region
pnpm bulk-discover "Bengaluru"
# Multiple regions in one run
pnpm bulk-discover "Bengaluru, San Francisco, London, Berlin, Mumbai, Beijing"
# Resume from a specific range index and page (useful after a crash or rate limit)
pnpm bulk-discover "Bengaluru, San Francisco" 0 5 # start at range 0, page 5
pnpm bulk-discover "Bengaluru, San Francisco" 2 1 # start at range 2, page 1If you've already scraped data and just need to (re)populate the leaderboard — use this. Reads entirely from api_cache, no GitHub calls made.
# Populate everything from cache
npx tsx src/scripts/populate-leaderboard-from-cache.ts
# Only process users not yet in the leaderboard (safest for large caches)
npx tsx src/scripts/populate-leaderboard-from-cache.ts --only-missing
# Preview what would run without writing anything
npx tsx src/scripts/populate-leaderboard-from-cache.ts --dry-run --limit=10
# Process a single user
npx tsx src/scripts/populate-leaderboard-from-cache.ts --username=torvalds
# Resume from a specific offset
npx tsx src/scripts/populate-leaderboard-from-cache.ts --offset=1000 --limit=500Use these to recompute scores or refresh the leaderboard after schema changes or bulk imports. Much faster than the TypeScript equivalents.
pnpm sql:populate-analyses # recompute skill scores from repos + PRs (~2 min for 72K users)
pnpm sql:populate-leaderboard # sync scored users → leaderboard (~30s for 72K users)Run populate-analyses before populate-leaderboard if recomputing from scratch.
repo_score = stars × (user_merged_prs / total_merged_prs)
- Repos with fewer than 10 stars are excluded
- Score is capped at 10,000 per repo
- Total score is the sum across all qualifying repos
Experience levels:
| Score | Label |
|---|---|
| < 10 | Newcomer |
| 10–99 | Contributor |
| 100–499 | Active Contributor |
| 500–1,999 | Core Contributor |
| ≥ 2,000 | Open Source Leader |
Full schema is in schema.sql — readable as-is. Tables are grouped into three layers:
| Layer | Tables | Purpose |
|---|---|---|
| Pipeline | github_users, github_repos, github_pull_requests, user_repo_scores, user_scores, analyses, leaderboard |
Stores scraped data, computed scores, skill breakdowns, and final rankings |
| Infrastructure | api_cache, token_rate_limit |
Caches GraphQL responses (30-day TTL) and tracks rate limit state per token |
| Legacy | users, repos |
Kept for backward compatibility; auto-synced during Stage 3 |