-
Notifications
You must be signed in to change notification settings - Fork 1
Add Apify marketplace Actor #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,70 @@ | ||
| # Wick Web Fetcher | ||
|
|
||
| A lightweight content extraction Actor powered by [Wick](https://getwick.dev), an open-source tool that uses Chrome's real network stack (Cronet) to fetch web pages. Because requests go through the same TLS implementation as a real Chrome browser (BoringSSL, HTTP/2, QUIC), Wick reaches sites that block raw HTTP clients. | ||
|
|
||
| ## When to use this Actor | ||
|
|
||
| - **Quick single-page fetches** where spinning up a full browser is overkill | ||
| - **LLM and RAG pipelines** that need clean markdown from web pages | ||
| - **Lightweight content extraction** at low memory cost (256 MB) | ||
| - **Complement to browser-based Actors** -- use Wick for the pages that don't need JS rendering, save browser compute for the pages that do | ||
|
|
||
| ## How it works | ||
|
|
||
| Under the hood, this Actor runs the Wick binary as a local HTTP API server inside the container. Wick makes requests using [Cronet](https://chromium.googlesource.com/chromium/src/+/master/components/cronet/) -- Chrome's network stack extracted as a standalone library. The response HTML is converted to clean markdown, stripping navigation, ads, and boilerplate. | ||
|
|
||
| No headless browser is launched. This makes it fast (~1-3s per page) and lightweight (256 MB vs typical 1-4 GB for browser-based Actors). | ||
|
|
||
| ## Modes | ||
|
|
||
| ### Fetch (default) | ||
|
|
||
| Fetches one or more URLs and returns clean content. Each URL becomes one row in the output dataset with title, content, status code, and timing. | ||
|
|
||
| ### Crawl | ||
|
|
||
| Starts from a URL and follows same-domain links. Returns content for every page discovered, each as a separate dataset row. Control depth (1-5) and max pages (1-50). | ||
|
|
||
| ### Map | ||
|
|
||
| Discovers all URLs on a site by checking sitemap.xml and following links. Returns a URL list without fetching content -- useful for planning a targeted crawl or building a sitemap. | ||
|
|
||
| ## Output | ||
|
|
||
| Each dataset row contains: | ||
|
|
||
| | Field | Description | | ||
| |-------|-------------| | ||
| | `url` | The URL that was fetched | | ||
| | `title` | Page title | | ||
| | `content` | Page content in markdown, HTML, or plain text | | ||
| | `statusCode` | HTTP response status | | ||
| | `timingMs` | Fetch duration in milliseconds | | ||
| | `format` | Output format used | | ||
| | `fetchedAt` | ISO 8601 timestamp | | ||
|
|
||
| ## Residential IP mode (optional) | ||
|
|
||
| For additional anti-detection, you can connect this Actor to your own Wick instance running on your machine. Requests then route through your residential IP, combining Apify's scheduling and monitoring with your own network. | ||
|
|
||
| 1. Install [Wick Pro](https://getwick.dev) on your machine | ||
| 2. Start the API server: `wick serve --api` | ||
| 3. Expose it via a tunnel (Cloudflare Tunnel, ngrok, etc.) | ||
| 4. Enter the tunnel URL in the **Wick Tunnel URL** input field | ||
|
|
||
| ## Limitations | ||
|
|
||
| - **No JavaScript rendering** in the bundled engine. For JS-heavy SPAs, pair this Actor with a browser-based Actor like [Website Content Crawler](https://apify.com/apify/website-content-crawler) or use Wick's tunnel mode with a Pro instance that includes JS rendering. | ||
| - **Best for content pages.** Wick excels at articles, documentation, blogs, and product pages. For structured data extraction (e.g., specific fields from a listing), consider combining Wick's output with an LLM or a purpose-built scraper. | ||
|
|
||
| ## Pricing | ||
|
|
||
| This Actor is **free** -- you only pay for Apify compute units. The Wick engine is open source ([MIT license](https://github.com/wickproject/wick)). | ||
|
|
||
| Residential IP mode requires [Wick Pro](https://getwick.dev) ($20/month). | ||
|
|
||
| ## Resources | ||
|
|
||
| - [Wick documentation](https://getwick.dev/docs.html) | ||
| - [GitHub repository](https://github.com/wickproject/wick) | ||
| - [How Wick's TLS fingerprinting works](https://getwick.dev/blog/why-your-ai-agent-cant-read-the-web.html) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,15 @@ | ||
| { | ||
| "actorSpecification": 1, | ||
| "name": "wick-web-fetcher", | ||
| "title": "Wick Web Fetcher — Browser-Grade Content Extraction", | ||
| "version": "1.0", | ||
| "buildTag": "latest", | ||
| "minMemoryMbytes": 256, | ||
| "maxMemoryMbytes": 1024, | ||
| "dockerfile": "../Dockerfile", | ||
| "readme": "./ACTOR.md", | ||
| "input": "./input_schema.json", | ||
| "storages": { | ||
| "dataset": "./dataset_schema.json" | ||
| } | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,51 @@ | ||
| { | ||
| "actorSpecification": 1, | ||
| "fields": { | ||
| "url": { | ||
| "type": "string", | ||
| "description": "The URL that was fetched" | ||
| }, | ||
| "title": { | ||
| "type": "string", | ||
| "description": "Page title extracted from HTML", | ||
| "nullable": true | ||
| }, | ||
| "content": { | ||
| "type": "string", | ||
| "description": "Page content in the requested format (markdown, html, or text)", | ||
| "nullable": true | ||
| }, | ||
| "urls": { | ||
| "type": "array", | ||
| "description": "Discovered URLs (map mode only)", | ||
| "nullable": true | ||
| }, | ||
| "format": { | ||
| "type": "string", | ||
| "description": "Output format used" | ||
| }, | ||
| "statusCode": { | ||
| "type": "integer", | ||
| "description": "HTTP status code from the fetch", | ||
| "nullable": true | ||
| }, | ||
| "timingMs": { | ||
| "type": "integer", | ||
| "description": "Time to fetch in milliseconds", | ||
| "nullable": true | ||
| }, | ||
| "engine": { | ||
| "type": "string", | ||
| "description": "wick-local (bundled engine) or wick-tunnel (residential IP)" | ||
| }, | ||
| "error": { | ||
| "type": "string", | ||
| "description": "Error message if the fetch failed", | ||
| "nullable": true | ||
| }, | ||
| "fetchedAt": { | ||
| "type": "string", | ||
| "description": "ISO 8601 timestamp of when the page was fetched" | ||
| } | ||
| } | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,65 @@ | ||
| { | ||
| "title": "Wick Web Fetcher Input", | ||
| "type": "object", | ||
| "schemaVersion": 1, | ||
| "properties": { | ||
| "urls": { | ||
| "title": "URLs", | ||
| "type": "array", | ||
| "description": "List of URLs to fetch", | ||
| "editor": "stringList", | ||
| "prefill": ["https://www.nytimes.com"] | ||
| }, | ||
| "mode": { | ||
| "title": "Mode", | ||
| "type": "string", | ||
| "description": "fetch = single pages, crawl = follow links, map = discover URLs", | ||
| "enum": ["fetch", "crawl", "map"], | ||
| "default": "fetch" | ||
| }, | ||
| "format": { | ||
| "title": "Output Format", | ||
| "type": "string", | ||
| "enum": ["markdown", "html", "text"], | ||
| "default": "markdown" | ||
| }, | ||
| "maxPages": { | ||
| "title": "Max Pages (crawl mode)", | ||
| "type": "integer", | ||
| "default": 10, | ||
| "minimum": 1, | ||
| "maximum": 50 | ||
| }, | ||
| "maxDepth": { | ||
| "title": "Max Depth (crawl mode)", | ||
| "type": "integer", | ||
| "default": 2, | ||
| "minimum": 1, | ||
| "maximum": 5 | ||
| }, | ||
| "mapLimit": { | ||
| "title": "Max URLs (map mode)", | ||
| "type": "integer", | ||
| "description": "Maximum number of URLs to discover in map mode", | ||
| "default": 100, | ||
| "minimum": 1, | ||
| "maximum": 5000 | ||
| }, | ||
| "wickTunnelUrl": { | ||
| "title": "Wick Tunnel URL (optional)", | ||
| "type": "string", | ||
| "description": "URL of your local Wick instance for residential IP routing. Leave blank to use Wick's built-in engine on Apify's infrastructure.", | ||
| "editor": "textfield", | ||
| "sectionCaption": "Advanced — Residential IP", | ||
| "sectionDescription": "Connect to your own Wick Pro instance to route requests through your residential IP instead of Apify's datacenter." | ||
| }, | ||
| "wickApiKey": { | ||
| "title": "Wick API Key (optional)", | ||
| "type": "string", | ||
| "description": "API key for your Wick tunnel endpoint", | ||
| "editor": "textfield", | ||
| "isSecret": true | ||
| } | ||
| }, | ||
| "required": ["urls"] | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,29 @@ | ||
| FROM node:20-slim | ||
|
|
||
| # Install Wick binary + libcronet.so from GitHub release | ||
| ARG WICK_VERSION=0.7.0 | ||
| ARG WICK_SHA256=110d074072ff5fb334ca3d0123def3f9463d5298f9c6a48fa727a03d21f08ea9 | ||
|
|
||
| RUN apt-get update \ | ||
| && apt-get install -y --no-install-recommends curl ca-certificates \ | ||
| && rm -rf /var/lib/apt/lists/* | ||
|
|
||
| RUN cd /tmp \ | ||
| && curl -fsSL "https://github.com/wickproject/wick/releases/download/v${WICK_VERSION}/wick-linux-amd64.tar.gz" -o wick.tar.gz \ | ||
| && echo "${WICK_SHA256} wick.tar.gz" | sha256sum -c - \ | ||
| && tar xzf wick.tar.gz \ | ||
| && mv wick /usr/local/bin/wick \ | ||
| && mv libcronet.so /usr/local/lib/libcronet.so \ | ||
| && chmod +x /usr/local/bin/wick \ | ||
| && ldconfig \ | ||
| && rm wick.tar.gz | ||
|
|
||
| # Verify wick runs | ||
| RUN wick version | ||
|
|
||
| WORKDIR /app | ||
| COPY package.json . | ||
| RUN npm install --production | ||
| COPY . . | ||
|
|
||
| CMD ["node", "src/main.js"] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,8 @@ | ||
| { | ||
| "name": "wick-web-fetcher", | ||
| "version": "1.0.0", | ||
| "type": "module", | ||
| "dependencies": { | ||
| "apify": "^3.2.0" | ||
| } | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,155 @@ | ||
| import { Actor } from 'apify'; | ||
| import { spawn } from 'child_process'; | ||
|
|
||
| const WICK_PORT = 18090; | ||
| const WICK_BASE = `http://127.0.0.1:${WICK_PORT}`; | ||
|
|
||
| await Actor.init(); | ||
|
|
||
| const input = (await Actor.getInput()) ?? {}; | ||
| const { | ||
| urls, | ||
| mode = 'fetch', | ||
| format = 'markdown', | ||
| maxPages = 10, | ||
| maxDepth = 2, | ||
| mapLimit = 100, | ||
| wickTunnelUrl, | ||
| wickApiKey, | ||
| } = input; | ||
|
|
||
| if (!Array.isArray(urls) || urls.length === 0) { | ||
| Actor.log.error('Input must include a non-empty "urls" array.'); | ||
| await Actor.exit({ exitCode: 1 }); | ||
| } | ||
|
|
||
| const dataset = await Actor.openDataset(); | ||
| const useTunnel = !!wickTunnelUrl; | ||
| const baseUrl = useTunnel ? wickTunnelUrl.replace(/\/$/, '') : WICK_BASE; | ||
| const headers = wickApiKey ? { Authorization: `Bearer ${wickApiKey}` } : {}; | ||
|
|
||
| // Start the bundled Wick API server if not using a tunnel | ||
| let wickProcess; | ||
| if (!useTunnel) { | ||
| Actor.log.info('Starting Wick API server...'); | ||
| wickProcess = spawn('/usr/local/bin/wick', ['serve', '--api', '--port', String(WICK_PORT)], { | ||
| env: { ...process.env, LD_LIBRARY_PATH: '/usr/local/lib' }, | ||
| stdio: ['ignore', 'pipe', 'pipe'], | ||
| }); | ||
|
|
||
myleshorton marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| wickProcess.on('error', (err) => { | ||
| Actor.log.error(`Failed to start Wick: ${err.message}`); | ||
| Actor.exit({ exitCode: 1 }); | ||
| }); | ||
|
|
||
| wickProcess.stdout.on('data', (chunk) => { | ||
| const msg = chunk.toString().trimEnd(); | ||
| if (msg) Actor.log.info(`[wick] ${msg}`); | ||
| }); | ||
| wickProcess.stderr.on('data', (chunk) => { | ||
| const msg = chunk.toString().trimEnd(); | ||
| if (msg) Actor.log.warning(`[wick] ${msg}`); | ||
| }); | ||
|
|
||
| // Wait for server to be ready | ||
| let ready = false; | ||
| for (let i = 0; i < 30; i++) { | ||
| try { | ||
| const resp = await fetch(`${WICK_BASE}/health`); | ||
| if (resp.ok) { ready = true; break; } | ||
| } catch { /* not ready yet */ } | ||
| await new Promise(r => setTimeout(r, 500)); | ||
| } | ||
|
|
||
| if (!ready) { | ||
| Actor.log.error('Wick API server failed to start within 15s'); | ||
| await Actor.exit({ exitCode: 1 }); | ||
| } | ||
| Actor.log.info('Wick API server ready'); | ||
| } else { | ||
| Actor.log.info(`Using Wick tunnel at ${wickTunnelUrl}`); | ||
| } | ||
|
|
||
| async function wickFetch(url) { | ||
| const params = new URLSearchParams({ url, format }); | ||
| const resp = await fetch(`${baseUrl}/v1/fetch?${params}`, { headers }); | ||
| if (!resp.ok) throw new Error(`Wick returned ${resp.status}: ${await resp.text()}`); | ||
| return resp.json(); | ||
| } | ||
|
|
||
| async function wickCrawl(url) { | ||
| const params = new URLSearchParams({ | ||
| url, format, max_pages: String(maxPages), max_depth: String(maxDepth), | ||
| }); | ||
| const resp = await fetch(`${baseUrl}/v1/crawl?${params}`, { headers }); | ||
| if (!resp.ok) throw new Error(`Wick returned ${resp.status}: ${await resp.text()}`); | ||
| return resp.json(); | ||
| } | ||
|
|
||
| async function wickMap(url) { | ||
| const params = new URLSearchParams({ url, limit: String(mapLimit) }); | ||
| const resp = await fetch(`${baseUrl}/v1/map?${params}`, { headers }); | ||
| if (!resp.ok) throw new Error(`Wick returned ${resp.status}: ${await resp.text()}`); | ||
myleshorton marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| return resp.json(); | ||
| } | ||
|
|
||
| const engine = useTunnel ? 'wick-tunnel' : 'wick-local'; | ||
|
|
||
| for (const url of urls) { | ||
| try { | ||
| Actor.log.info(`${mode}: ${url}`); | ||
|
|
||
| if (mode === 'crawl') { | ||
| const result = await wickCrawl(url); | ||
| for (const page of result.pages || []) { | ||
| await dataset.pushData({ | ||
| url: page.url, | ||
| title: page.title || null, | ||
| content: page.content, | ||
| format, | ||
| fetchedAt: new Date().toISOString(), | ||
| engine, | ||
| }); | ||
| } | ||
| Actor.log.info(`Crawled ${result.pages?.length || 0} pages from ${url}`); | ||
| } else if (mode === 'map') { | ||
| const result = await wickMap(url); | ||
| await dataset.pushData({ | ||
| url, | ||
| urls: result.urls, | ||
| format: 'urls', | ||
| timingMs: result.timing_ms, | ||
| fetchedAt: new Date().toISOString(), | ||
| engine, | ||
| }); | ||
| Actor.log.info(`Mapped ${result.count} URLs from ${url}`); | ||
| } else { | ||
| const result = await wickFetch(url); | ||
| await dataset.pushData({ | ||
| url, | ||
| title: result.title || null, | ||
| content: result.content, | ||
| statusCode: result.status, | ||
| timingMs: result.timing_ms, | ||
| format, | ||
| fetchedAt: new Date().toISOString(), | ||
| engine, | ||
| }); | ||
| } | ||
| } catch (err) { | ||
| Actor.log.error(`Failed: ${url}: ${err.message}`); | ||
| await dataset.pushData({ | ||
| url, | ||
| error: err.message, | ||
| fetchedAt: new Date().toISOString(), | ||
| engine, | ||
| }); | ||
| } | ||
| } | ||
|
|
||
| // Clean up | ||
| if (wickProcess) { | ||
| wickProcess.kill('SIGTERM'); | ||
| } | ||
|
|
||
| await Actor.exit(); | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.