🪝 Hook Mining Engine

Weekly-updated library of viral social media hook patterns — crawls 1,000+ posts/week, extracts hooks, classifies patterns, and surfaces emerging trends.

🎯 What It Does

The Hook Mining Engine is an automated pipeline that:

Crawls viral social media posts from Reddit, HackerNews, LinkedIn, and Twitter
Extracts attention-grabbing opening hooks using NLP heuristics
Classifies hooks into 12 canonical pattern types via GPT-4o-mini
Embeds hooks as vectors for semantic similarity search (pgvector)
Detects trending and emerging hook patterns week-over-week

12 Hook Patterns

Pattern	Description	Example
`curiosity_gap`	Creates information gaps	"Nobody tells you this about startups..."
`transformation`	Shows before → after	"I went from $0 to $1M in 12 months"
`contrarian`	Challenges conventional wisdom	"Everything you know about SEO is wrong"
`social_proof`	Leverages numbers/authority	"After 10 years building SaaS companies..."
`how_to_promise`	Promises specific knowledge	"Here's exactly how to grow your newsletter"
`pain_point`	Targets specific struggles	"Stop making this copywriting mistake"
`authority`	Establishes credibility first	"I've sold $50M in online courses. Here's..."
`list_format`	Uses numbered lists	"7 frameworks that 10x your conversion rate"
`question_hook`	Opens with engaging questions	"What if you could double your traffic?"
`urgency`	Creates time pressure	"This strategy works right now in 2025"
`story_opener`	Starts with narrative	"In 2019, I was broke and sleeping on a couch"
`bold_claim`	Makes provocative assertions	"Email marketing is dead. Here's what works."

🏗️ Architecture

┌──────────────────────────────────────────────────────────────────┐
│                     Hook Mining Engine                           │
├──────────────────┬──────────────────┬────────────────────────────┤
│   Data Sources   │  Pipeline Tasks  │      Serving Layer         │
│                  │                  │                            │
│  Reddit (PRAW)   │  crawl_task ─────│──→ FastAPI REST API        │
│  HackerNews      │  extract_task    │──→ Streamlit Dashboard     │
│  LinkedIn        │  classify_task   │──→ Prometheus Metrics      │
│  Twitter/X       │  trend_task      │──→ Grafana Dashboards      │
├──────────────────┼──────────────────┼────────────────────────────┤
│               PostgreSQL + pgvector │  Redis (Broker + Cache)    │
└──────────────────┴──────────────────┴────────────────────────────┘

Key Technologies

API: FastAPI + Uvicorn (async, OpenAPI docs)
Tasks: Celery + Redis (distributed pipeline)
Database: PostgreSQL 16 + pgvector (vector similarity search)
Cache/Broker: Redis 7 (task broker + bloom filter + caching)
NLP: spaCy (sentence splitting), sentence-transformers (embeddings)
LLM: OpenAI GPT-4o-mini (hook classification)
Monitoring: Prometheus + Grafana

🚀 Quick Start

Prerequisites

Python 3.11+
Docker & Docker Compose
OpenAI API key
Reddit API credentials

1. Clone & Configure

git clone https://github.com/your-org/hook-mining-engine.git
cd hook-mining-engine
cp .env.example .env
# Edit .env with your API keys

2. Start Infrastructure

# Start all services (PostgreSQL, Redis, API, Worker, Dashboard, Monitoring)
docker compose up -d

# Or start infrastructure only for local development
docker compose up -d postgres redis

3. Run Database Migrations

# Apply all migrations
alembic upgrade head

# Seed the pattern taxonomy
python scripts/seed_pattern_taxonomy.py

4. Run a Test Crawl

# Crawl 50 posts from Reddit
python scripts/run_single_crawl.py --platform reddit --limit 50

5. Access Services

Service	URL	Description
API	http://localhost:8000	REST API
API Docs	http://localhost:8000/docs	OpenAPI / Swagger
Dashboard	http://localhost:8501	Streamlit analytics
Flower	http://localhost:5555	Celery task monitor
Grafana	http://localhost:3000	Metrics dashboards
Prometheus	http://localhost:9090	Metrics collector

🛠️ Development

Local Setup (without Docker)

# Create virtual environment
python -m venv .venv
source .venv/bin/activate  # or .venv\Scripts\activate on Windows

# Install dependencies
pip install -e ".[dev,test]"

# Download spaCy model
python -m spacy download en_core_web_sm

# Start the API server
uvicorn src.library.api.main:app --reload --port 8000

# Start a Celery worker
celery -A src.pipeline.celery_app.celery_app worker --loglevel=info

# Start Celery Beat (scheduler)
celery -A src.pipeline.celery_app.celery_app beat --loglevel=info

Run Tests

pytest
pytest --cov=src --cov-report=html

Code Quality

ruff check src/
mypy src/

📁 Project Structure

hook-engine/
├── config/
│   ├── settings.py          # Central configuration (Pydantic)
│   └── logging.yaml         # Logging configuration
├── src/
│   ├── classifier/          # GPT-4o-mini hook classification
│   │   ├── pattern_taxonomy.py
│   │   ├── cluster_analyzer.py
│   │   └── prompts/
│   ├── crawler/             # URL deduplication + Scrapy spiders
│   │   ├── dedup_filter.py
│   │   └── scrapy_project/
│   ├── extractor/           # Hook extraction pipeline
│   │   ├── hook_extractor.py
│   │   ├── sentence_splitter.py
│   │   ├── embedding_generator.py
│   │   └── virality_scorer.py
│   ├── library/             # API, models, repositories
│   │   ├── api/             # FastAPI routers + schemas
│   │   ├── models/          # SQLAlchemy ORM models
│   │   ├── repositories/    # Data access layer
│   │   └── dashboard/       # Streamlit analytics
│   ├── pipeline/            # Celery task orchestration
│   │   ├── celery_app.py
│   │   ├── workflow.py
│   │   └── tasks/
│   ├── sources/             # Platform API integrations
│   │   ├── base_source.py
│   │   ├── reddit_source.py
│   │   ├── hackernews_source.py
│   │   ├── linkedin_source.py
│   │   └── twitter_source.py
│   └── utils/               # Shared utilities
│       ├── db.py
│       ├── redis_client.py
│       ├── http_client.py
│       └── observability.py
├── scripts/                 # CLI utilities
├── k8s/                     # Kubernetes manifests
├── monitoring/              # Prometheus + Grafana configs
├── alembic/                 # Database migrations
├── docker-compose.yml
├── Dockerfile
└── pyproject.toml

📊 API Endpoints

Hooks

GET /v1/hooks — List hooks with filters (platform, type, score)
GET /v1/hooks/{id} — Get hook details
POST /v1/hooks/semantic-search — Semantic similarity search
GET /v1/hooks/export/csv — Export hooks as CSV

Patterns

GET /v1/patterns — List all patterns with weekly stats
GET /v1/patterns/{hook_type} — Get pattern details + trend history

Crawl Runs

GET /v1/crawl-runs — List crawl run history
GET /v1/crawl-runs/{id} — Get crawl run details

Health

GET /health — Liveness probe
GET /health/ready — Readiness probe (checks DB + Redis)

📈 Pipeline Schedule

The pipeline runs automatically every Sunday at 2:00 AM UTC via Celery Beat:

crawl_task → extract_task → classify_task → trend_task

Each stage processes data in batches and hands off to the next via Celery task chaining.

🔧 Environment Variables

See .env.example for all available configuration options.

Required variables:

DATABASE_URL — PostgreSQL connection string
REDIS_URL — Redis connection string
LLM_PROVIDER — The LLM provider to use (openai, groq, or gemini)
OPENAI_API_KEY or GROQ_API_KEY or GEMINI_API_KEY — API key for the chosen provider
REDDIT_CLIENT_ID / REDDIT_CLIENT_SECRET — Reddit API credentials

Optional variables:

TWITTER_BEARER_TOKEN — Twitter API access
LINKEDIN_RAPIDAPI_KEY — LinkedIn data API access
SENTRY_DSN — Error tracking

📝 License

MIT License — see LICENSE for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🪝 Hook Mining Engine

🎯 What It Does

12 Hook Patterns

🏗️ Architecture

Key Technologies

🚀 Quick Start

Prerequisites

1. Clone & Configure

2. Start Infrastructure

3. Run Database Migrations

4. Run a Test Crawl

5. Access Services

🛠️ Development

Local Setup (without Docker)

Run Tests

Code Quality

📁 Project Structure

📊 API Endpoints

Hooks

Patterns

Crawl Runs

Health

📈 Pipeline Schedule

🔧 Environment Variables

📝 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
alembic		alembic
config		config
k8s		k8s
monitoring		monitoring
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
scan.py		scan.py

Folders and files

Latest commit

History

Repository files navigation

🪝 Hook Mining Engine

🎯 What It Does

12 Hook Patterns

🏗️ Architecture

Key Technologies

🚀 Quick Start

Prerequisites

1. Clone & Configure

2. Start Infrastructure

3. Run Database Migrations

4. Run a Test Crawl

5. Access Services

🛠️ Development

Local Setup (without Docker)

Run Tests

Code Quality

📁 Project Structure

📊 API Endpoints

Hooks

Patterns

Crawl Runs

Health

📈 Pipeline Schedule

🔧 Environment Variables

📝 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages