Skip to content

spidra-io/spidra-ruby

spidra-ruby

Official Ruby SDK for the Spidra web scraping and crawling API. Scrape pages, run browser actions, batch-process URLs, and crawl entire sites — all from Ruby, with no external dependencies.

Installation

gem install spidra

Or add it to your Gemfile:

gem "spidra"

Requires Ruby 2.7 or higher.

Quick start

require "spidra"

client = Spidra.new(ENV["SPIDRA_API_KEY"])

job = client.scrape.run(
  { urls: [{ url: "https://example.com/pricing" }],
    prompt: "Extract all pricing plans with name, price, and features",
    output: "json" }
)

puts job["content"]

Get your API key from app.spidra.io under Settings → API Keys.

Scraping

scrape.run

Submit a job and wait for it to finish. Returns the full result.

job = client.scrape.run(
  urls:   [{ url: "https://example.com" }],
  prompt: "Extract the main headline and subheading"
)

puts job["content"]

Pass poll_interval: and timeout: as keyword arguments to control how long it waits:

job = client.scrape.run(
  { urls: [{ url: "https://example.com" }], prompt: "..." },
  poll_interval: 5,
  timeout: 60
)

On timeout, run returns { "status" => "timeout", "jobId" => "..." } so you can keep polling with scrape.get.

scrape.submit and scrape.get

Fire and forget — submit a job and check status yourself.

response = client.scrape.submit(
  urls:   [{ url: "https://example.com" }],
  prompt: "Extract the main headline"
)
job_id = response["jobId"]

# Later...
status = client.scrape.get(job_id)
puts status["content"] if status["status"] == "completed"

Scrape parameters

Parameter Type Description
urls Array Up to 3 entries. Each is { url: "..." } with optional actions:
prompt String What to extract, in plain English
output String "markdown" (default) or "json"
schema Hash JSON Schema to enforce a specific output shape
use_proxy Boolean Route through a residential proxy
proxy_country String Two-letter country code, e.g. "us", "de", "jp"
extract_content_only Boolean Strip nav, ads, and boilerplate before extraction
screenshot Boolean Capture a viewport screenshot
full_page_screenshot Boolean Capture a full-page screenshot
cookies String Raw Cookie header for authenticated pages

Browser actions

Pass an actions: array inside a URL entry to interact with the page before extraction runs.

job = client.scrape.run(
  urls: [
    {
      url:     "https://example.com/products",
      actions: [
        { type: "click",  selector: "#accept-cookies" },
        { type: "wait",   duration: 1000 },
        { type: "scroll", to: "80%" }
      ]
    }
  ],
  prompt: "Extract all product names and prices"
)

Batch scraping

Submit up to 50 URLs in one request. They all run in parallel.

batch = client.batch.run(
  { urls: [
      "https://shop.example.com/product/1",
      "https://shop.example.com/product/2",
      "https://shop.example.com/product/3"
    ],
    prompt: "Extract product name, price, and stock status",
    output: "json" }
)

puts "#{batch["completedCount"]}/#{batch["totalUrls"]} completed"

batch["items"].each do |item|
  if item["status"] == "completed"
    puts item["result"].inspect
  else
    puts "Failed: #{item["url"]}#{item["error"]}"
  end
end

batch.submit and batch.get

response = client.batch.submit(
  urls:   ["https://example.com/1", "https://example.com/2"],
  prompt: "Extract the page title"
)
batch_id = response["batchId"]

result = client.batch.get(batch_id)
puts "#{result["completedCount"]}/#{result["totalUrls"]} done"

Retry failed items

if result["failedCount"] > 0
  client.batch.retry(batch_id)
end

Cancel a batch

client.batch.cancel(batch_id)

List past batches

page = client.batch.list(1, 20) # page, limit

page["jobs"].each do |job|
  puts "#{job["uuid"]} #{job["status"]}#{job["completedCount"]}/#{job["totalUrls"]}"
end

Crawling

job = client.crawl.run(
  { base_url:               "https://competitor.com/blog",
    crawl_instruction:      "Follow blog post links only — skip tag and category pages",
    transform_instruction:  "Extract post title, author, publish date, and a one-sentence summary",
    max_pages:              30,
    use_proxy:              true }
)

job["result"].each do |page|
  puts "#{page["url"]}: #{page["data"].inspect}"
end

Crawl jobs often take a few minutes. The default timeout for crawl.run is 300 seconds. Adjust with timeout: n if you expect longer runs.

crawl.submit and crawl.get

response = client.crawl.submit(
  base_url:              "https://example.com/docs",
  crawl_instruction:     "Follow all documentation pages",
  transform_instruction: "Extract the page title and a short content summary",
  max_pages:             50
)
job_id = response["jobId"]

status = client.crawl.get(job_id)
# status["status"]: "waiting" | "active" | "running" | "completed" | "failed"

Downloading raw content

result = client.crawl.pages(job_id)

result["pages"].each do |page|
  puts page["url"]
  # page["html_url"]     — download the raw HTML (expires in 1 hour)
  # page["markdown_url"] — download the Markdown version
end

Re-extracting with a new prompt

result = client.crawl.extract(completed_job_id, "Extract product SKUs and prices as JSON")
new_job_id = result["jobId"]

extracted = client.crawl.get(new_job_id)

History and stats

history = client.crawl.history(1, 10)
puts "#{history["total"]} total crawl jobs"

stats = client.crawl.stats
puts "#{stats["total"]} all-time"

Logs

result = client.logs.list(
  status:     "failed",
  searchTerm: "amazon.com",
  dateStart:  "2024-01-01",
  dateEnd:    "2024-12-31",
  page:       1,
  limit:      20
)

result["logs"].each do |log|
  puts "#{log["urls"][0]["url"]}#{log["status"]} (#{log["credits_used"]} credits)"
end

# Full detail for a single log entry
log = client.logs.get(log_uuid)
puts log["result_data"].inspect

Usage statistics

rows = client.usage.get("30d") # "7d" | "30d" | "weekly"

rows.each do |row|
  puts "#{row["date"]}: #{row["requests"]} requests, #{row["credits"]} credits"
end

Error handling

require "spidra"

begin
  job = client.scrape.run(
    urls:   [{ url: "https://example.com" }],
    prompt: "Extract the headline"
  )
rescue Spidra::AuthenticationError
  puts "Invalid or missing API key"
rescue Spidra::InsufficientCreditsError
  puts "Account is out of credits"
rescue Spidra::RateLimitError
  puts "Rate limited — slow down"
rescue Spidra::ServerError => e
  puts "Server error (#{e.status}): #{e.message}"
rescue Spidra::Error => e
  puts "API error #{e.status}: #{e.message}"
end
Exception HTTP status When
Spidra::AuthenticationError 401 Missing or invalid API key
Spidra::InsufficientCreditsError 403 No credits remaining
Spidra::RateLimitError 429 Too many requests
Spidra::ServerError 5xx Unexpected server-side error
Spidra::Error any Base class for all Spidra exceptions

All exceptions expose .status (HTTP status code) and .message.

License

MIT. See LICENSE for details.

About

Official Ruby SDK for Spidra

Topics

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages