Weekly-updated library of viral social media hook patterns β crawls 1,000+ posts/week, extracts hooks, classifies patterns, and surfaces emerging trends.
The Hook Mining Engine is an automated pipeline that:
- Crawls viral social media posts from Reddit, HackerNews, LinkedIn, and Twitter
- Extracts attention-grabbing opening hooks using NLP heuristics
- Classifies hooks into 12 canonical pattern types via GPT-4o-mini
- Embeds hooks as vectors for semantic similarity search (pgvector)
- Detects trending and emerging hook patterns week-over-week
| Pattern | Description | Example |
|---|---|---|
curiosity_gap |
Creates information gaps | "Nobody tells you this about startups..." |
transformation |
Shows before β after | "I went from $0 to $1M in 12 months" |
contrarian |
Challenges conventional wisdom | "Everything you know about SEO is wrong" |
social_proof |
Leverages numbers/authority | "After 10 years building SaaS companies..." |
how_to_promise |
Promises specific knowledge | "Here's exactly how to grow your newsletter" |
pain_point |
Targets specific struggles | "Stop making this copywriting mistake" |
authority |
Establishes credibility first | "I've sold $50M in online courses. Here's..." |
list_format |
Uses numbered lists | "7 frameworks that 10x your conversion rate" |
question_hook |
Opens with engaging questions | "What if you could double your traffic?" |
urgency |
Creates time pressure | "This strategy works right now in 2025" |
story_opener |
Starts with narrative | "In 2019, I was broke and sleeping on a couch" |
bold_claim |
Makes provocative assertions | "Email marketing is dead. Here's what works." |
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Hook Mining Engine β
ββββββββββββββββββββ¬βββββββββββββββββββ¬βββββββββββββββββββββββββββββ€
β Data Sources β Pipeline Tasks β Serving Layer β
β β β β
β Reddit (PRAW) β crawl_task βββββββββ FastAPI REST API β
β HackerNews β extract_task ββββ Streamlit Dashboard β
β LinkedIn β classify_task ββββ Prometheus Metrics β
β Twitter/X β trend_task ββββ Grafana Dashboards β
ββββββββββββββββββββΌβββββββββββββββββββΌβββββββββββββββββββββββββββββ€
β PostgreSQL + pgvector β Redis (Broker + Cache) β
ββββββββββββββββββββ΄βββββββββββββββββββ΄βββββββββββββββββββββββββββββ
- API: FastAPI + Uvicorn (async, OpenAPI docs)
- Tasks: Celery + Redis (distributed pipeline)
- Database: PostgreSQL 16 + pgvector (vector similarity search)
- Cache/Broker: Redis 7 (task broker + bloom filter + caching)
- NLP: spaCy (sentence splitting), sentence-transformers (embeddings)
- LLM: OpenAI GPT-4o-mini (hook classification)
- Monitoring: Prometheus + Grafana
- Python 3.11+
- Docker & Docker Compose
- OpenAI API key
- Reddit API credentials
git clone https://github.com/your-org/hook-mining-engine.git
cd hook-mining-engine
cp .env.example .env
# Edit .env with your API keys# Start all services (PostgreSQL, Redis, API, Worker, Dashboard, Monitoring)
docker compose up -d
# Or start infrastructure only for local development
docker compose up -d postgres redis# Apply all migrations
alembic upgrade head
# Seed the pattern taxonomy
python scripts/seed_pattern_taxonomy.py# Crawl 50 posts from Reddit
python scripts/run_single_crawl.py --platform reddit --limit 50| Service | URL | Description |
|---|---|---|
| API | http://localhost:8000 | REST API |
| API Docs | http://localhost:8000/docs | OpenAPI / Swagger |
| Dashboard | http://localhost:8501 | Streamlit analytics |
| Flower | http://localhost:5555 | Celery task monitor |
| Grafana | http://localhost:3000 | Metrics dashboards |
| Prometheus | http://localhost:9090 | Metrics collector |
# Create virtual environment
python -m venv .venv
source .venv/bin/activate # or .venv\Scripts\activate on Windows
# Install dependencies
pip install -e ".[dev,test]"
# Download spaCy model
python -m spacy download en_core_web_sm
# Start the API server
uvicorn src.library.api.main:app --reload --port 8000
# Start a Celery worker
celery -A src.pipeline.celery_app.celery_app worker --loglevel=info
# Start Celery Beat (scheduler)
celery -A src.pipeline.celery_app.celery_app beat --loglevel=infopytest
pytest --cov=src --cov-report=htmlruff check src/
mypy src/hook-engine/
βββ config/
β βββ settings.py # Central configuration (Pydantic)
β βββ logging.yaml # Logging configuration
βββ src/
β βββ classifier/ # GPT-4o-mini hook classification
β β βββ pattern_taxonomy.py
β β βββ cluster_analyzer.py
β β βββ prompts/
β βββ crawler/ # URL deduplication + Scrapy spiders
β β βββ dedup_filter.py
β β βββ scrapy_project/
β βββ extractor/ # Hook extraction pipeline
β β βββ hook_extractor.py
β β βββ sentence_splitter.py
β β βββ embedding_generator.py
β β βββ virality_scorer.py
β βββ library/ # API, models, repositories
β β βββ api/ # FastAPI routers + schemas
β β βββ models/ # SQLAlchemy ORM models
β β βββ repositories/ # Data access layer
β β βββ dashboard/ # Streamlit analytics
β βββ pipeline/ # Celery task orchestration
β β βββ celery_app.py
β β βββ workflow.py
β β βββ tasks/
β βββ sources/ # Platform API integrations
β β βββ base_source.py
β β βββ reddit_source.py
β β βββ hackernews_source.py
β β βββ linkedin_source.py
β β βββ twitter_source.py
β βββ utils/ # Shared utilities
β βββ db.py
β βββ redis_client.py
β βββ http_client.py
β βββ observability.py
βββ scripts/ # CLI utilities
βββ k8s/ # Kubernetes manifests
βββ monitoring/ # Prometheus + Grafana configs
βββ alembic/ # Database migrations
βββ docker-compose.yml
βββ Dockerfile
βββ pyproject.toml
GET /v1/hooksβ List hooks with filters (platform, type, score)GET /v1/hooks/{id}β Get hook detailsPOST /v1/hooks/semantic-searchβ Semantic similarity searchGET /v1/hooks/export/csvβ Export hooks as CSV
GET /v1/patternsβ List all patterns with weekly statsGET /v1/patterns/{hook_type}β Get pattern details + trend history
GET /v1/crawl-runsβ List crawl run historyGET /v1/crawl-runs/{id}β Get crawl run details
GET /healthβ Liveness probeGET /health/readyβ Readiness probe (checks DB + Redis)
The pipeline runs automatically every Sunday at 2:00 AM UTC via Celery Beat:
crawl_task β extract_task β classify_task β trend_task
Each stage processes data in batches and hands off to the next via Celery task chaining.
See .env.example for all available configuration options.
Required variables:
DATABASE_URLβ PostgreSQL connection stringREDIS_URLβ Redis connection stringLLM_PROVIDERβ The LLM provider to use (openai,groq, orgemini)OPENAI_API_KEYorGROQ_API_KEYorGEMINI_API_KEYβ API key for the chosen providerREDDIT_CLIENT_ID/REDDIT_CLIENT_SECRETβ Reddit API credentials
Optional variables:
TWITTER_BEARER_TOKENβ Twitter API accessLINKEDIN_RAPIDAPI_KEYβ LinkedIn data API accessSENTRY_DSNβ Error tracking
MIT License β see LICENSE for details.