Skip to content

qinscode/SeekSpider

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

89 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

SeekSpider

Smart Job Scraper for SEEK
A powerful, AI-augmented web scraping tool built with Scrapy, designed to extract, process, and analyze job listings from seek.com.au. SeekSpider enables real-time job market intelligence with tech stack trends, salary insights, and clean PostgreSQL integration.

Python Scrapy PostgreSQL Selenium AI Integration License


πŸ“š Overview

SeekSpider is a modular scraping system designed for job market analysis. It collects IT-related job postings from SEEK using Scrapy and Selenium, enriches the data with AI-powered salary and tech stack analysis, and stores everything into a PostgreSQL database with JSONB fields for flexibility and speed.


βš™οΈ Features

πŸ•Έ Data Collection

  • Scrapy crawler with category + pagination traversal
  • Selenium-based authentication
  • BeautifulSoup integration for fine-grained parsing

🧠 AI Integration

  • Extracts and analyzes technology stacks
  • Normalizes salary info
  • Generates demand statistics on tech usage

πŸ’Ύ Database & Storage

  • PostgreSQL with JSONB for flexible schema
  • Transaction-safe pipeline with smart upserts
  • Automatic job status tracking

🧰 Architecture

  • Modular class structure (DatabaseManager, AIClient, Logger, Utils)
  • Environment-configured settings
  • Batch-safe crawling and retry mechanisms

πŸš€ Getting Started

Prerequisites

  • Python 3.9+
  • PostgreSQL (with an active database)
  • Google Chrome + ChromeDriver
  • Git

Installation

git clone https://github.com/your-username/SeekSpider.git
cd SeekSpider
pip install -r requirements.txt

Configuration

Create a .env file in the root directory:

POSTGRESQL_HOST=localhost
POSTGRESQL_PORT=5432
POSTGRESQL_USER=postgres
POSTGRESQL_PASSWORD=secret
POSTGRESQL_DATABASE=seek_data
POSTGRESQL_TABLE=Jobs

SEEK_USERNAME=your_email
SEEK_PASSWORD=your_password

AI_API_KEY=your_api_key
AI_API_URL=https://api.openai.com/v1/...
AI_MODEL=gpt-4

Make sure PostgreSQL is running and your credentials are correct.


πŸƒ Run the Spider

Option 1: With main script

python main.py

Option 2: With Scrapy

scrapy crawl seek

This will log in to SEEK, collect job data, and store it into PostgreSQL.


πŸ” API Query Parameters

The spider uses Seek’s internal search API. Here’s an example:

search_params = {
    'where': 'All Perth WA',
    'classification': '6281',  # IT category
    'seekSelectAllPages': 'true',
    'locale': 'en-AU',
}
  • Supports subclassification traversal
  • Automatically paginated
  • SEO metadata enabled
  • Auth tokens handled automatically

🧱 Project Structure

SeekSpider/
β”œβ”€β”€ spiders/seek_spider.py      # Main spider
β”œβ”€β”€ pipelines.py                # Data insertion logic
β”œβ”€β”€ items.py                    # Data model
β”œβ”€β”€ settings.py                 # Scrapy settings
β”œβ”€β”€ main.py                     # Entry point
β”œβ”€β”€ db/                         # Database utilities
β”œβ”€β”€ ai/                         # AI analysis components
└── utils/                      # Parsing, token, salary analyzers

🧩 Key Modules

  • DatabaseManager: Context-managed PostgreSQL operations with retries
  • Logger: Colored logging with levels + per-component logs
  • AIClient: Handles external API requests and formatting
  • TechStackAnalyzer: NLP-based tech term extraction
  • SalaryNormalizer: Converts pay ranges to numeric bounds
  • Config: Loads and validates .env settings

πŸ—ƒ Database Schema

-- ----------------------------
-- Table structure for Jobs
-- ----------------------------
DROP TABLE IF EXISTS "public"."Jobs";
CREATE TABLE "public"."Jobs" (
  "Id" int4 NOT NULL GENERATED BY DEFAULT AS IDENTITY (
INCREMENT 1
MINVALUE  1
MAXVALUE 2147483647
START 1
CACHE 1
),
  "JobTitle" text COLLATE "pg_catalog"."default",
  "BusinessName" text COLLATE "pg_catalog"."default",
  "WorkType" text COLLATE "pg_catalog"."default",
  "JobType" text COLLATE "pg_catalog"."default",
  "PayRange" text COLLATE "pg_catalog"."default",
  "Suburb" text COLLATE "pg_catalog"."default",
  "Area" text COLLATE "pg_catalog"."default",
  "Url" text COLLATE "pg_catalog"."default",
  "PostedDate" timestamp(6),
  "JobDescription" text COLLATE "pg_catalog"."default",
  "AdvertiserId" int4,
  "CreatedAt" timestamp(6) NOT NULL DEFAULT CURRENT_TIMESTAMP,
  "UpdatedAt" timestamp(6) NOT NULL DEFAULT CURRENT_TIMESTAMP,
  "IsNew" bool,
  "IsActive" bool DEFAULT true,
  "ExpiryDate" timestamp(6),
  "MaxSalary" int4,
  "MinSalary" int4,
  "LocationType" text COLLATE "pg_catalog"."default",
  "TechStack" text COLLATE "pg_catalog"."default",
  "IsUserCreated" bool
)
;
ALTER TABLE "public"."Jobs" OWNER TO "postgres";

-- ----------------------------
-- Primary Key structure for table Jobs
-- ----------------------------
ALTER TABLE "public"."Jobs" ADD CONSTRAINT "PK_Jobs" PRIMARY KEY ("Id");

Recommended indexes:

CREATE INDEX idx_active ON "Jobs" ("IsActive");
CREATE INDEX idx_salary ON "Jobs" ("MinSalary", "MaxSalary");
CREATE INDEX idx_techstack ON "Jobs" USING GIN ("TechStack");

🀝 Contributing

Pull requests are welcome!
Please open an issue to discuss major changes.

git checkout -b feature/my-new-feature
git commit -m "feat: add new parser"
git push origin feature/my-new-feature

πŸ“„ License

Licensed under the Apache License 2.0.


πŸ™ Acknowledgments

About

Seekspider: A Scrapy Project for Job Scraping

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published