Skip to content

Async Booking.com scraper with Playwright — Multi-city search, detail extraction, pricing, amenities, and reviews with anti-bot handling

License

Notifications You must be signed in to change notification settings

Edioff/booking-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Booking.com Scraper — Async Multi-City Hotel Data Extraction

Python Playwright asyncio License

Async Booking.com scraper with 3 parallel workers, extracting hotel listings, room details, pricing, amenities, reviews, and geolocation across multiple cities. Built with Playwright for full JavaScript rendering.

Overview

A production-grade scraper for Booking.com that handles the platform's complex dynamic UI. Uses Playwright's async API with concurrent workers to extract detailed hotel and room-level data at scale.

The scraper operates in two phases:

  1. Search phase — Discovers hotels by city with scroll-based pagination and "Load more" button detection
  2. Detail phase — 3 concurrent workers visit each hotel page to extract room-level pricing, amenities, images, and availability

Features

  • Async architecture — Built on asyncio + Playwright async API for high throughput
  • 3 parallel workers — Concurrent detail page processing with queue-based distribution
  • Multi-city search — Configure multiple cities with different parameters
  • Deep room extraction — Individual room types, pricing plans, cancellation policies
  • Modal handling — Opens and parses room detail modals for complete data
  • Anti-overlay system — Automatically detects and closes popups, modals, and prompts
  • Smart pagination — Handles "Load more" buttons, scroll-based loading, and traditional pagination
  • Geolocation — Extracts lat/lon from structured data (JSON-LD)
  • Configurable — All parameters via environment variables

Data Points Extracted

Hotel Level

Field Description
Hotel name Property title
Address Full street address
City / Country Location details
Latitude / Longitude Geo coordinates from JSON-LD
Property type Hotel, apartment, hostel, etc.
Listing status Active/inactive
Amenities Full amenity list with flags
Images Property photo URLs
Review score Comfort/quality rating

Room Level (per unit)

Field Description
Unit ID Room type identifier
Unit name Room type name
Price amount Numeric price
Price currency COP, USD, EUR, etc.
Price text Full price string
Plan name Rate plan description
Cancellation Free cancellation policy text
Beds Bed configuration text
Size (m²) Room size in square meters
Amenities Room-specific amenities
Sections Private bathroom, view, equipment, smoking policy

Tech Stack

Python Playwright

  • Playwright (async) — Full browser automation with JavaScript rendering
  • asyncio — Concurrent task execution with worker pools
  • Python 3.10+ — Dataclasses, type hints, zoneinfo

Installation

git clone https://github.com/Edioff/booking-scraper.git
cd booking-scraper
pip install -r requirements.txt
playwright install chromium

Configuration

Cities file (cities.booking.json)

{
  "cities": [
    {
      "name": "Bogota",
      "dest_id": "-592318",
      "dest_type": "city",
      "nights": 2,
      "adults": 2,
      "currency": "COP",
      "lang": "es"
    }
  ]
}

Environment variables

Variable Default Description
DETAIL_WORKERS 3 Parallel workers for detail extraction
MAX_IDLE_WAVES 5 Max scroll waves without new results
VIEWPORT_W 1920 Browser viewport width
VIEWPORT_H 1080 Browser viewport height
BOOKING_FORCE_TOMORROW true Auto-set check-in to tomorrow
BROWSER_PER_DETAIL true Fresh browser per detail page
ON_OPEN_ESC_WAIT_MS 3000 Wait time after pressing ESC on overlays
SCROLL_STEP_PX 1200 Pixels per scroll step

Usage

python booking_scraper.py

Results are saved to data/booking_results_<timestamp>.json.

Architecture

main()
  │
  ├── For each city in config:
  │   ├── Build search URL with dates/guests/currency
  │   ├── Load search page
  │   ├── Scroll + "Load More" pagination
  │   ├── Extract hotel cards → Queue
  │   │
  │   └── 3x Detail Workers (concurrent):
  │       ├── Dequeue hotel URL
  │       ├── Navigate to hotel page
  │       ├── Close overlays/popups
  │       ├── Extract hotel-level data
  │       ├── Extract room units (JS + DOM)
  │       ├── Parse room modals for full details
  │       ├── Extract amenities, images, coordinates
  │       └── Save structured result
  │
  └── Output: data/booking_results_{timestamp}.json

Notes

  • Designed for educational and research purposes
  • Respect Booking.com's Terms of Service and robots.txt
  • Use responsible rate limiting — the default 3 workers with delays is a good balance
  • Overlay/popup handling covers multiple languages (ES, EN, PT, IT, DE)

Author

Johan Cruz — Data Engineer & Web Scraping Specialist

  • GitHub: @Edioff
  • Available for freelance projects

License

MIT

About

Async Booking.com scraper with Playwright — Multi-city search, detail extraction, pricing, amenities, and reviews with anti-bot handling

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages