SeekSpider

Smart Job Scraper for SEEK
A powerful, AI-augmented web scraping tool built with Scrapy, designed to extract, process, and analyze job listings from seek.com.au. SeekSpider enables real-time job market intelligence with tech stack trends, salary insights, and clean PostgreSQL integration.

📚 Overview

SeekSpider is a modular scraping system designed for job market analysis. It collects IT-related job postings from SEEK using Scrapy and Selenium, enriches the data with AI-powered salary and tech stack analysis, and stores everything into a PostgreSQL database with JSONB fields for flexibility and speed.

⚙️ Features

🕸 Data Collection

Scrapy crawler with category + pagination traversal
Selenium-based authentication
BeautifulSoup integration for fine-grained parsing

🧠 AI Integration

Extracts and analyzes technology stacks
Normalizes salary info
Generates demand statistics on tech usage

💾 Database & Storage

PostgreSQL with JSONB for flexible schema
Transaction-safe pipeline with smart upserts
Automatic job status tracking

🧰 Architecture

Modular class structure (DatabaseManager, AIClient, Logger, Utils)
Environment-configured settings
Batch-safe crawling and retry mechanisms

🚀 Getting Started

Prerequisites

Python 3.9+
PostgreSQL (with an active database)
Google Chrome + ChromeDriver
Git

Installation

git clone https://github.com/your-username/SeekSpider.git
cd SeekSpider
pip install -r requirements.txt

Configuration

Create a .env file in the root directory:

POSTGRESQL_HOST=localhost
POSTGRESQL_PORT=5432
POSTGRESQL_USER=postgres
POSTGRESQL_PASSWORD=secret
POSTGRESQL_DATABASE=seek_data
POSTGRESQL_TABLE=Jobs

SEEK_USERNAME=your_email
SEEK_PASSWORD=your_password

AI_API_KEY=your_api_key
AI_API_URL=https://api.openai.com/v1/...
AI_MODEL=gpt-4

Make sure PostgreSQL is running and your credentials are correct.

🏃 Run the Spider

Option 1: With main script

python main.py

Option 2: With Scrapy

scrapy crawl seek

This will log in to SEEK, collect job data, and store it into PostgreSQL.

🔍 API Query Parameters

The spider uses Seek’s internal search API. Here’s an example:

search_params = {
    'where': 'All Perth WA',
    'classification': '6281',  # IT category
    'seekSelectAllPages': 'true',
    'locale': 'en-AU',
}

Supports subclassification traversal
Automatically paginated
SEO metadata enabled
Auth tokens handled automatically

🧱 Project Structure

SeekSpider/
├── spiders/seek_spider.py      # Main spider
├── pipelines.py                # Data insertion logic
├── items.py                    # Data model
├── settings.py                 # Scrapy settings
├── main.py                     # Entry point
├── db/                         # Database utilities
├── ai/                         # AI analysis components
└── utils/                      # Parsing, token, salary analyzers

🧩 Key Modules

DatabaseManager: Context-managed PostgreSQL operations with retries
Logger: Colored logging with levels + per-component logs
AIClient: Handles external API requests and formatting
TechStackAnalyzer: NLP-based tech term extraction
SalaryNormalizer: Converts pay ranges to numeric bounds
Config: Loads and validates .env settings

🗃 Database Schema

-- ----------------------------
-- Table structure for Jobs
-- ----------------------------
DROP TABLE IF EXISTS "public"."Jobs";
CREATE TABLE "public"."Jobs" (
  "Id" int4 NOT NULL GENERATED BY DEFAULT AS IDENTITY (
INCREMENT 1
MINVALUE  1
MAXVALUE 2147483647
START 1
CACHE 1
),
  "JobTitle" text COLLATE "pg_catalog"."default",
  "BusinessName" text COLLATE "pg_catalog"."default",
  "WorkType" text COLLATE "pg_catalog"."default",
  "JobType" text COLLATE "pg_catalog"."default",
  "PayRange" text COLLATE "pg_catalog"."default",
  "Suburb" text COLLATE "pg_catalog"."default",
  "Area" text COLLATE "pg_catalog"."default",
  "Url" text COLLATE "pg_catalog"."default",
  "PostedDate" timestamp(6),
  "JobDescription" text COLLATE "pg_catalog"."default",
  "AdvertiserId" int4,
  "CreatedAt" timestamp(6) NOT NULL DEFAULT CURRENT_TIMESTAMP,
  "UpdatedAt" timestamp(6) NOT NULL DEFAULT CURRENT_TIMESTAMP,
  "IsNew" bool,
  "IsActive" bool DEFAULT true,
  "ExpiryDate" timestamp(6),
  "MaxSalary" int4,
  "MinSalary" int4,
  "LocationType" text COLLATE "pg_catalog"."default",
  "TechStack" text COLLATE "pg_catalog"."default",
  "IsUserCreated" bool
)
;
ALTER TABLE "public"."Jobs" OWNER TO "postgres";

-- ----------------------------
-- Primary Key structure for table Jobs
-- ----------------------------
ALTER TABLE "public"."Jobs" ADD CONSTRAINT "PK_Jobs" PRIMARY KEY ("Id");

Recommended indexes:

CREATE INDEX idx_active ON "Jobs" ("IsActive");
CREATE INDEX idx_salary ON "Jobs" ("MinSalary", "MaxSalary");
CREATE INDEX idx_techstack ON "Jobs" USING GIN ("TechStack");

🤝 Contributing

Pull requests are welcome!
Please open an issue to discuss major changes.

git checkout -b feature/my-new-feature
git commit -m "feat: add new parser"
git push origin feature/my-new-feature

📄 License

Licensed under the Apache License 2.0.

🙏 Acknowledgments

Scrapy for the powerful crawling engine
Selenium for seamless login automation
BeautifulSoup for DOM parsing

Name		Name	Last commit message	Last commit date
Latest commit History 89 Commits
SeekSpider		SeekSpider
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
main.py		main.py
requirements.txt		requirements.txt
scrapy.cfg		scrapy.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SeekSpider

📚 Overview

⚙️ Features

🕸 Data Collection

🧠 AI Integration

💾 Database & Storage

🧰 Architecture

🚀 Getting Started

Prerequisites

Installation

Configuration

🏃 Run the Spider

Option 1: With main script

Option 2: With Scrapy

🔍 API Query Parameters

🧱 Project Structure

🧩 Key Modules

🗃 Database Schema

🤝 Contributing

📄 License

🙏 Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Languages

License

qinscode/SeekSpider

Folders and files

Latest commit

History

Repository files navigation

SeekSpider

📚 Overview

⚙️ Features

🕸 Data Collection

🧠 AI Integration

💾 Database & Storage

🧰 Architecture

🚀 Getting Started

Prerequisites

Installation

Configuration

🏃 Run the Spider

Option 1: With main script

Option 2: With Scrapy

🔍 API Query Parameters

🧱 Project Structure

🧩 Key Modules

🗃 Database Schema

🤝 Contributing

📄 License

🙏 Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Languages

Packages