Skip to content

SachinSSh/hook-engine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸͺ Hook Mining Engine

Weekly-updated library of viral social media hook patterns β€” crawls 1,000+ posts/week, extracts hooks, classifies patterns, and surfaces emerging trends.

Python 3.11+ FastAPI Celery


🎯 What It Does

The Hook Mining Engine is an automated pipeline that:

  1. Crawls viral social media posts from Reddit, HackerNews, LinkedIn, and Twitter
  2. Extracts attention-grabbing opening hooks using NLP heuristics
  3. Classifies hooks into 12 canonical pattern types via GPT-4o-mini
  4. Embeds hooks as vectors for semantic similarity search (pgvector)
  5. Detects trending and emerging hook patterns week-over-week

12 Hook Patterns

Pattern Description Example
curiosity_gap Creates information gaps "Nobody tells you this about startups..."
transformation Shows before β†’ after "I went from $0 to $1M in 12 months"
contrarian Challenges conventional wisdom "Everything you know about SEO is wrong"
social_proof Leverages numbers/authority "After 10 years building SaaS companies..."
how_to_promise Promises specific knowledge "Here's exactly how to grow your newsletter"
pain_point Targets specific struggles "Stop making this copywriting mistake"
authority Establishes credibility first "I've sold $50M in online courses. Here's..."
list_format Uses numbered lists "7 frameworks that 10x your conversion rate"
question_hook Opens with engaging questions "What if you could double your traffic?"
urgency Creates time pressure "This strategy works right now in 2025"
story_opener Starts with narrative "In 2019, I was broke and sleeping on a couch"
bold_claim Makes provocative assertions "Email marketing is dead. Here's what works."

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     Hook Mining Engine                           β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚   Data Sources   β”‚  Pipeline Tasks  β”‚      Serving Layer         β”‚
β”‚                  β”‚                  β”‚                            β”‚
β”‚  Reddit (PRAW)   β”‚  crawl_task ─────│──→ FastAPI REST API        β”‚
β”‚  HackerNews      β”‚  extract_task    │──→ Streamlit Dashboard     β”‚
β”‚  LinkedIn        β”‚  classify_task   │──→ Prometheus Metrics      β”‚
β”‚  Twitter/X       β”‚  trend_task      │──→ Grafana Dashboards      β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚               PostgreSQL + pgvector β”‚  Redis (Broker + Cache)    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key Technologies

  • API: FastAPI + Uvicorn (async, OpenAPI docs)
  • Tasks: Celery + Redis (distributed pipeline)
  • Database: PostgreSQL 16 + pgvector (vector similarity search)
  • Cache/Broker: Redis 7 (task broker + bloom filter + caching)
  • NLP: spaCy (sentence splitting), sentence-transformers (embeddings)
  • LLM: OpenAI GPT-4o-mini (hook classification)
  • Monitoring: Prometheus + Grafana

πŸš€ Quick Start

Prerequisites

  • Python 3.11+
  • Docker & Docker Compose
  • OpenAI API key
  • Reddit API credentials

1. Clone & Configure

git clone https://github.com/your-org/hook-mining-engine.git
cd hook-mining-engine
cp .env.example .env
# Edit .env with your API keys

2. Start Infrastructure

# Start all services (PostgreSQL, Redis, API, Worker, Dashboard, Monitoring)
docker compose up -d

# Or start infrastructure only for local development
docker compose up -d postgres redis

3. Run Database Migrations

# Apply all migrations
alembic upgrade head

# Seed the pattern taxonomy
python scripts/seed_pattern_taxonomy.py

4. Run a Test Crawl

# Crawl 50 posts from Reddit
python scripts/run_single_crawl.py --platform reddit --limit 50

5. Access Services

Service URL Description
API http://localhost:8000 REST API
API Docs http://localhost:8000/docs OpenAPI / Swagger
Dashboard http://localhost:8501 Streamlit analytics
Flower http://localhost:5555 Celery task monitor
Grafana http://localhost:3000 Metrics dashboards
Prometheus http://localhost:9090 Metrics collector

πŸ› οΈ Development

Local Setup (without Docker)

# Create virtual environment
python -m venv .venv
source .venv/bin/activate  # or .venv\Scripts\activate on Windows

# Install dependencies
pip install -e ".[dev,test]"

# Download spaCy model
python -m spacy download en_core_web_sm

# Start the API server
uvicorn src.library.api.main:app --reload --port 8000

# Start a Celery worker
celery -A src.pipeline.celery_app.celery_app worker --loglevel=info

# Start Celery Beat (scheduler)
celery -A src.pipeline.celery_app.celery_app beat --loglevel=info

Run Tests

pytest
pytest --cov=src --cov-report=html

Code Quality

ruff check src/
mypy src/

πŸ“ Project Structure

hook-engine/
β”œβ”€β”€ config/
β”‚   β”œβ”€β”€ settings.py          # Central configuration (Pydantic)
β”‚   └── logging.yaml         # Logging configuration
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ classifier/          # GPT-4o-mini hook classification
β”‚   β”‚   β”œβ”€β”€ pattern_taxonomy.py
β”‚   β”‚   β”œβ”€β”€ cluster_analyzer.py
β”‚   β”‚   └── prompts/
β”‚   β”œβ”€β”€ crawler/             # URL deduplication + Scrapy spiders
β”‚   β”‚   β”œβ”€β”€ dedup_filter.py
β”‚   β”‚   └── scrapy_project/
β”‚   β”œβ”€β”€ extractor/           # Hook extraction pipeline
β”‚   β”‚   β”œβ”€β”€ hook_extractor.py
β”‚   β”‚   β”œβ”€β”€ sentence_splitter.py
β”‚   β”‚   β”œβ”€β”€ embedding_generator.py
β”‚   β”‚   └── virality_scorer.py
β”‚   β”œβ”€β”€ library/             # API, models, repositories
β”‚   β”‚   β”œβ”€β”€ api/             # FastAPI routers + schemas
β”‚   β”‚   β”œβ”€β”€ models/          # SQLAlchemy ORM models
β”‚   β”‚   β”œβ”€β”€ repositories/    # Data access layer
β”‚   β”‚   └── dashboard/       # Streamlit analytics
β”‚   β”œβ”€β”€ pipeline/            # Celery task orchestration
β”‚   β”‚   β”œβ”€β”€ celery_app.py
β”‚   β”‚   β”œβ”€β”€ workflow.py
β”‚   β”‚   └── tasks/
β”‚   β”œβ”€β”€ sources/             # Platform API integrations
β”‚   β”‚   β”œβ”€β”€ base_source.py
β”‚   β”‚   β”œβ”€β”€ reddit_source.py
β”‚   β”‚   β”œβ”€β”€ hackernews_source.py
β”‚   β”‚   β”œβ”€β”€ linkedin_source.py
β”‚   β”‚   └── twitter_source.py
β”‚   └── utils/               # Shared utilities
β”‚       β”œβ”€β”€ db.py
β”‚       β”œβ”€β”€ redis_client.py
β”‚       β”œβ”€β”€ http_client.py
β”‚       └── observability.py
β”œβ”€β”€ scripts/                 # CLI utilities
β”œβ”€β”€ k8s/                     # Kubernetes manifests
β”œβ”€β”€ monitoring/              # Prometheus + Grafana configs
β”œβ”€β”€ alembic/                 # Database migrations
β”œβ”€β”€ docker-compose.yml
β”œβ”€β”€ Dockerfile
└── pyproject.toml

πŸ“Š API Endpoints

Hooks

  • GET /v1/hooks β€” List hooks with filters (platform, type, score)
  • GET /v1/hooks/{id} β€” Get hook details
  • POST /v1/hooks/semantic-search β€” Semantic similarity search
  • GET /v1/hooks/export/csv β€” Export hooks as CSV

Patterns

  • GET /v1/patterns β€” List all patterns with weekly stats
  • GET /v1/patterns/{hook_type} β€” Get pattern details + trend history

Crawl Runs

  • GET /v1/crawl-runs β€” List crawl run history
  • GET /v1/crawl-runs/{id} β€” Get crawl run details

Health

  • GET /health β€” Liveness probe
  • GET /health/ready β€” Readiness probe (checks DB + Redis)

πŸ“ˆ Pipeline Schedule

The pipeline runs automatically every Sunday at 2:00 AM UTC via Celery Beat:

crawl_task β†’ extract_task β†’ classify_task β†’ trend_task

Each stage processes data in batches and hands off to the next via Celery task chaining.


πŸ”§ Environment Variables

See .env.example for all available configuration options.

Required variables:

  • DATABASE_URL β€” PostgreSQL connection string
  • REDIS_URL β€” Redis connection string
  • LLM_PROVIDER β€” The LLM provider to use (openai, groq, or gemini)
  • OPENAI_API_KEY or GROQ_API_KEY or GEMINI_API_KEY β€” API key for the chosen provider
  • REDDIT_CLIENT_ID / REDDIT_CLIENT_SECRET β€” Reddit API credentials

Optional variables:

  • TWITTER_BEARER_TOKEN β€” Twitter API access
  • LINKEDIN_RAPIDAPI_KEY β€” LinkedIn data API access
  • SENTRY_DSN β€” Error tracking

πŸ“ License

MIT License β€” see LICENSE for details.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages