diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml new file mode 100644 index 00000000..53ce5c3e --- /dev/null +++ b/.github/workflows/ci.yml @@ -0,0 +1,34 @@ +name: CI + +on: + pull_request: + push: + branches: + - main + - master + - "cursor/**" + +jobs: + quality: + runs-on: ubuntu-latest + steps: + - name: Checkout + uses: actions/checkout@v4 + + - name: Setup Node + uses: actions/setup-node@v4 + with: + node-version: 20 + cache: yarn + + - name: Install dependencies + run: yarn install --frozen-lockfile + + - name: Lint + run: yarn lint + + - name: Build + run: yarn build + + - name: Test + run: yarn test --runInBand diff --git a/AGENTS.md b/AGENTS.md index 1c981ec5..0dea3910 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -25,7 +25,7 @@ - `yarn test` launches Jest; set `CI=true` for coverage and deterministic snapshots. - `yarn cli -c "..." [--debug --hyperbrowser --mcp ]` runs the agent; `--hyperbrowser` switches to the remote provider and `--debug` drops artifacts into `debug/`. - `yarn example ` (backed by `ts-node -r tsconfig-paths/register`) is the quickest way to execute flows in `examples/` or `scripts/`. -- DOM metadata builds at runtime via the a11y provider; the legacy `build-dom-tree-script` entry points at a removed fileβ€”avoid relying on it until refreshed. +- DOM metadata builds at runtime via the a11y provider; use runtime capture flows and `yarn example` probes rather than any removed legacy DOM-builder entrypoint. ## Agent Runtime & Integrations - The agent loop (`agent/tools/agent.ts`) captures the accessibility tree via `captureDOMState` (text-first, optional streaming and snapshot cache). Visual overlays/screenshots are opt-in (`enableVisualMode`) and composited with CDP screenshots for `page.ai`. diff --git a/README.md b/README.md index e1d0705a..ad1cf5d8 100644 --- a/README.md +++ b/README.md @@ -110,6 +110,56 @@ console.log(res); await agent.closeAgent(); ``` +### Async Task Controls + +`executeTaskAsync()` returns a control handle you can pause/resume/cancel, plus a `result` promise you can await when you need the final output. + +```typescript +const task = await agent.executeTaskAsync("Sign in and fetch account details"); + +// Optional runtime control +task.pause(); +task.resume(); + +// Await final outcome at any time +const final = await task.result; +console.log(final.status, final.output); + +// Task failures reject with HyperagentTaskError (includes taskId + cause) +task.result.catch((error) => { + console.error(error.taskId, error.cause?.message); +}); +``` + +### Migration Notes (Current Runtime Contract) + +- **`page.perform()` is the canonical single-action API.** + - `page.aiAction()` is still available as a compatibility alias. + - Calling `page.aiAction()` emits a one-time deprecation warning per agent instance. + - Prefer `perform` for all new code and docs. +- **`executeTaskAsync()` now has a first-class completion promise.** + - Use `task.result` to await the final output deterministically. +- **Top-level type exports are available from `@hyperbrowser/agent`.** + - You can import common task/cache/config types from the package root instead of internal paths. +- **CDP remains configurable per agent.** + - If needed for a workflow, disable CDP with `cdpActions: false` to force Playwright fallback. +- **Single-action debug artifacts use canonical perform naming.** + - `executeSingleAction` debug output is written under `debug/perform/...`. + +### Importing Public Types + +Core workflow types are available directly from the package entrypoint: + +```typescript +import type { + ActionCacheOutput, + AgentTaskOutput, + HyperAgentConfig, + PerformTaskParams, + TaskOutput, +} from "@hyperbrowser/agent"; +``` + ## Two Modes of Operation HyperAgent provides two complementary APIs optimized for different use cases: @@ -135,8 +185,26 @@ await page.goto("https://example.com/login"); await page.perform("fill email with user@example.com"); await page.perform("fill password with mypassword"); await page.perform("click the login button"); + +// Optional retries tuning for single-action mode +await page.perform("click the login button", { + maxElementRetries: 5, + retryDelayMs: 250, + maxContextSwitchRetries: 4, + contextSwitchRetryDelayMs: 500, +}); ``` +**Perform retry options**: +- `maxElementRetries`: attempts to refetch/find a target element before failing. +- `retryDelayMs`: delay between element-refetch retries. +- `maxContextSwitchRetries`: retries when a tab/context switch interrupts an in-flight action. +- `contextSwitchRetryDelayMs`: delay between context-switch retries (defaults to 500ms, capped for safety). +- `cdpActions`: override CDP usage for this call (`true` by default from agent config). +- `filterAdTrackingFrames`: override iframe filtering for this action (`true` by default). Set to `false` when you intentionally need ad/tracking iframes in scope. +- `maxSteps` (**deprecated**): compatibility alias for `maxElementRetries`. + - Using `maxSteps` emits a one-time deprecation warning per agent instance. + ### 🧠 `page.ai()` - Complex Multi-Step Tasks **Best for**: Complex workflows requiring multiple steps and visual context @@ -152,6 +220,8 @@ await page.perform("click the login button"); - `useDomCache` (boolean): Reuse DOM snapshots for speed - `enableVisualMode` (boolean): Enable screenshots and overlays (default: false) +- `cdpActions` (boolean): override CDP usage for this task (inherits agent-level default when omitted) +- `filterAdTrackingFrames` (boolean): override ad/tracking iframe filtering for this task (inherits agent-level default when omitted) **Example**: @@ -301,6 +371,12 @@ const agent = new HyperAgent({ }); ``` +`llm` must be either: +- a provider config object (`{ provider, model, ... }`), or +- an object implementing the HyperAgent LLM client interface (`invoke`, `invokeStructured`, `getProviderId`, `getModelId`, `getCapabilities`). + +Invalid/malformed `llm` payloads fail fast with a configuration error. + ### MCP Support HyperAgent functions as a fully functional MCP client. For best results, we recommend using @@ -339,6 +415,17 @@ console.log(response); await agent.closeAgent(); ``` +You can dynamically disconnect MCP servers later: + +```typescript +// Fire-and-forget disconnect +agent.disconnectFromMCPServer("server-id"); + +// Awaited disconnect with success/failure result +const didDisconnect = await agent.disconnectFromMCPServerAsync("server-id"); +console.log({ didDisconnect }); +``` + ### Custom Actions HyperAgent's capabilities can be extended with custom actions. Custom actions require 3 things: @@ -448,6 +535,8 @@ const page = await agent.newPage(); const replay = await page.runFromActionCache(cache, { maxXPathRetries: 3, // Retry XPath resolution up to 3 times before LLM fallback debug: true, + cdpActions: true, // Optional: override CDP usage for this replay + filterAdTrackingFrames: false, // Optional: include ad/tracking iframes during replay resolution }); console.log(replay); @@ -494,7 +583,11 @@ HyperAgent integrates seamlessly with Playwright, so you can still use familiar - **Deep Iframe Support**: Tracking across nested and cross-origin iframes (OOPIFs) - **Exact Coordinates**: Actions use precise CDP coordinates for reliability +Frame filtering normalizes protocol-relative and scheme-less iframe URLs for host-based matching, while intentionally avoiding host-based ad matches for path-only URLs. + Keep in mind that CDP is still experimental, and stability is not guaranteed. If you'd like the agent to use Playwright's native locators/actions instead, set `cdpActions: false` when you create the agent and it will fall back automatically. +If you need to inspect ad/tracking iframes for a specific workflow, keep CDP enabled and set `filterAdTrackingFrames: false` in `new HyperAgent({ ... })`. +You can also override this per invocation using `page.ai(..., { filterAdTrackingFrames: false })` or `page.perform(..., { filterAdTrackingFrames: false })`. The CDP layer is still evolvingβ€”expect rapid polish (and the occasional sharp edge). If you hit something quirky you can toggle CDP off for that workflow and drop us a bug report. diff --git a/currentState.md b/currentState.md index 09fb7991..a8439bb3 100644 --- a/currentState.md +++ b/currentState.md @@ -1,499 +1,251 @@ -# HyperAgent Current State Analysis - -## Overview -HyperAgent is a browser automation SDK that uses LLM-powered agents to execute tasks on web pages. It provides both imperative page methods (`page.ai()`, `page.extract()`) and a programmatic task execution API. - ---- - -## Core Architecture - -### 1. Entry Points & Public API - -#### **HyperAgent Class** ([src/agent/index.ts](src/agent/index.ts)) - -The main class that orchestrates everything: - -```typescript -class HyperAgent { - // Core methods - async executeTask(task: string, params?: TaskParams, initPage?: Page): Promise - async executeTaskAsync(task: string, params?: TaskParams, initPage?: Page): Promise - - // Page management - async getCurrentPage(): Promise - async newPage(): Promise - async getPages(): Promise - - // Browser lifecycle - async initBrowser(): Promise - async closeAgent(): Promise -} -``` - -#### **HyperPage Interface** ([src/agent/index.ts:567-605](src/agent/index.ts#L567-L605)) - -Enhanced Playwright `Page` with AI methods: - -```typescript -interface HyperPage extends Page { - // Execute a task on this page - ai(task: string, params?: TaskParams): Promise - - // Execute task asynchronously (non-blocking) - aiAsync(task: string, params?: TaskParams): Promise - - // Extract structured data - extract( - task?: string, - outputSchema?: z.AnyZodObject, - params?: TaskParams - ): Promise -} -``` - -**Key Implementation Details:** -- `page.ai()` β†’ calls `agent.executeTask(task, params, page)` ([line 569-570](src/agent/index.ts#L569-L570)) -- `page.extract()` β†’ wraps `executeTask()` with extraction-specific prompts ([lines 573-603](src/agent/index.ts#L573-L603)) - - Adds `maxSteps: 2` by default for extractions - - Prepends extraction instructions to the task - - Parses JSON output if outputSchema provided - ---- - -## 2. Task Execution Flow - -### **Main Task Loop** ([src/agent/tools/agent.ts:105-306](src/agent/tools/agent.ts#L105-L306)) - -``` -runAgentTask() - β”œβ”€β”€ 1. Get DOM State (getDom) - β”‚ β”œβ”€β”€ Inject JavaScript into page - β”‚ β”œβ”€β”€ Find interactive elements - β”‚ β”œβ”€β”€ Draw numbered overlay (canvas) - β”‚ └── Capture screenshot with overlay - β”‚ - β”œβ”€β”€ 2. Build Agent Messages (buildAgentStepMessages) - β”‚ β”œβ”€β”€ System prompt - β”‚ β”œβ”€β”€ Task description - β”‚ β”œβ”€β”€ Previous steps context - β”‚ β”œβ”€β”€ DOM representation (text) - β”‚ └── Screenshot (base64 image) - β”‚ - β”œβ”€β”€ 3. Invoke LLM (llm.invokeStructured) - β”‚ β”œβ”€β”€ Request structured output (Zod schema) - β”‚ └── Get list of actions to execute - β”‚ - β”œβ”€β”€ 4. Execute Actions (runAction) - β”‚ β”œβ”€β”€ For each action in list - β”‚ β”œβ”€β”€ Run action handler - β”‚ └── Wait 2 seconds between actions - β”‚ - └── 5. Repeat until complete/cancelled/maxSteps -``` - -**Location:** [src/agent/tools/agent.ts:132-291](src/agent/tools/agent.ts#L132-L291) - ---- - -## 3. DOM State Extraction - -### **Current Implementation: Visual DOM with Canvas Overlay** - -#### **Entry Point:** `getDom(page)` ([src/context-providers/dom/index.ts:5-18](src/context-providers/dom/index.ts#L5-L18)) - -```typescript -export const getDom = async (page: Page): Promise => { - const result = await page.evaluate(buildDomViewJs) as DOMStateRaw; - return { - elements: Map, - domState: string, // Text representation - screenshot: string // Base64 PNG with overlays - }; -}; -``` - -#### **Build DOM View** ([src/context-providers/dom/build-dom-view.ts:54-130](src/context-providers/dom/build-dom-view.ts#L54-L130)) - -**Process:** -1. **Find Interactive Elements** ([find-interactive-elements.ts:4-63](src/context-providers/dom/find-interactive-elements.ts#L4-L63)) - - Traverse entire DOM including Shadow DOM and iframes - - Check each element with `isInteractiveElem(element)` - - Returns `InteractiveElement[]` with metadata - -2. **Render Highlights Offscreen** ([highlight.ts:105-222](src/context-providers/dom/highlight.ts#L105-L222)) - - Create `OffscreenCanvas` with viewport dimensions - - Draw colored rectangles around each interactive element - - Draw numbered labels (1, 2, 3...) on each element - - Return `ImageBitmap` - -3. **Composite Screenshot** ([agent.ts:33-42](src/agent/tools/agent.ts#L33-L42)) - ```typescript - const compositeScreenshot = async (page: Page, overlay: string) => { - const screenshot = await page.screenshot({ type: "png" }); - // Overlay numbered boxes onto base screenshot using Jimp - baseImage.composite(overlayImage, 0, 0); - return buffer.toString("base64"); - }; - ``` - -4. **Build Text Representation** ([build-dom-view.ts:78-123](src/context-providers/dom/build-dom-view.ts#L78-L123)) - ``` - [1] - [2] - Some text between elements - [3]View Pricing - ``` - -**Output Structure:** -```typescript -interface DOMState { - elements: Map // index β†’ element mapping - domState: string // [idx]text format - screenshot: string // base64 PNG with overlays -} -``` - ---- - -## 4. Action System - -### **Available Actions** ([src/agent/actions/](src/agent/actions/)) - -| Action | Purpose | Key Parameters | Location | -|--------|---------|----------------|----------| -| `clickElement` | Click an element | `index: number` | [click-element.ts](src/agent/actions/click-element.ts) | -| `inputText` | Fill input field | `index: number, text: string` | [input-text.ts](src/agent/actions/input-text.ts) | -| `extract` | Extract data | `objective: string` | [extract.ts](src/agent/actions/extract.ts) | -| `goToUrl` | Navigate to URL | `url: string` | [go-to-url.ts](src/agent/actions/go-to-url.ts) | -| `selectOption` | Select dropdown | `index: number, option: string` | [select-option.ts](src/agent/actions/select-option.ts) | -| `scroll` | Scroll page | `direction: "up"\|"down"` | [scroll.ts](src/agent/actions/scroll.ts) | -| `keyPress` | Press keyboard key | `key: string` | [key-press.ts](src/agent/actions/key-press.ts) | -| `complete` | End task | `output?: string` | [complete.ts](src/agent/actions/complete.ts) | - -### **Action Execution** ([src/agent/tools/agent.ts:71-103](src/agent/tools/agent.ts#L71-L103)) - -#### **Click Element Example** ([click-element.ts:18-57](src/agent/actions/click-element.ts#L18-L57)) - -```typescript -run: async function (ctx: ActionContext, action: ClickElementActionType) { - const { index } = action; - const locator = getLocator(ctx, index); // Get element by index - - await locator.scrollIntoViewIfNeeded({ timeout: 2500 }); - await locator.waitFor({ state: "visible", timeout: 2500 }); - await waitForElementToBeEnabled(locator, 2500); - await waitForElementToBeStable(locator, 2500); - - await locator.click({ force: true }); - return { success: true, message: `Clicked element with index ${index}` }; -} -``` - -**Element Selection:** ([actions/utils.ts](src/agent/actions/utils.ts)) -```typescript -export const getLocator = (ctx: ActionContext, index: number): Locator | null => { - const element = ctx.domState.elements.get(index); - if (!element) return null; - return ctx.page.locator(element.cssPath); // Use CSS path selector -}; -``` - ---- - -## 5. Key Workflows - -### **Workflow 1: `page.ai("click the login button")`** - -1. User calls `page.ai("click the login button")` -2. β†’ `agent.executeTask(task, params, page)` ([index.ts:569](src/agent/index.ts#L569)) -3. β†’ `runAgentTask()` starts task loop ([agent.ts:105](src/agent/tools/agent.ts#L105)) -4. β†’ `getDom(page)` extracts DOM + screenshot ([agent.ts:155](src/agent/tools/agent.ts#L155)) - - Injects JS to find interactive elements - - Draws numbered overlays - - Composites screenshot -5. β†’ `buildAgentStepMessages()` creates LLM prompt ([agent.ts:201](src/agent/tools/agent.ts#L201)) -6. β†’ `llm.invokeStructured()` gets action plan ([agent.ts:220](src/agent/tools/agent.ts#L220)) -7. β†’ Execute actions ([agent.ts:253-275](src/agent/tools/agent.ts#L253-L275)) - - LLM returns: `{ type: "clickElement", params: { index: 5 } }` - - `runAction()` calls `ClickElementActionDefinition.run()` - - Gets locator for element 5 - - Clicks element via Playwright -8. β†’ Repeat loop or mark complete - -### **Workflow 2: `page.extract("product prices", schema)`** - -1. User calls `page.extract("product prices", PriceSchema)` -2. β†’ Wraps task: "You have to perform an extraction on the current page..." ([index.ts:586-590](src/agent/index.ts#L586-L590)) -3. β†’ Sets `maxSteps: 2` (extractions are quick) ([index.ts:581](src/agent/index.ts#L581)) -4. β†’ Adds `outputSchema` to actions ([index.ts:584](src/agent/index.ts#L584)) -5. β†’ `executeTask()` runs normal agent loop -6. β†’ LLM returns structured output matching schema -7. β†’ Parse JSON and return typed result ([index.ts:592](src/agent/index.ts#L592)) - -### **Workflow 3: Extract Action (Internal)** - -The `extract` action is **different** from `page.extract()`: - -**Location:** [src/agent/actions/extract.ts](src/agent/actions/extract.ts) - -```typescript -run: async (ctx: ActionContext, action: ExtractActionType) => { - // Get page HTML - const content = await ctx.page.content(); - const markdown = await parseMarkdown(content); - - // Take screenshot via CDP - const cdpSession = await ctx.page.context().newCDPSession(ctx.page); - const screenshot = await cdpSession.send("Page.captureScreenshot"); - - // Call LLM with markdown + screenshot - const response = await ctx.llm.invoke([{ - role: "user", - content: [ - { type: "text", text: `Extract: "${objective}"\n\n${markdown}` }, - { type: "image", url: `data:image/png;base64,${screenshot.data}` } - ] - }]); - - return { success: true, message: `Extracted: ${content}` }; -} -``` - -**This is an action the agent can choose** during task execution, not the page-level method. - ---- - -## 6. DOM State Representation - -### **Current Approach: Visual DOM + Numbered Overlay** - -**Strengths:** -- βœ… Simple index-based selection (LLM just says "5") -- βœ… Visual feedback in screenshots -- βœ… Works well with vision models - -**Weaknesses:** -- ❌ Screenshot required every step (slow) -- ❌ Screenshot β†’ LLM β†’ token cost is high -- ❌ Numbered overlay can occlude important UI -- ❌ Full DOM traversal every step (no caching) -- ❌ Large token counts (screenshot + DOM text) - -**Performance:** -- ~8,000-15,000 tokens per step -- ~1,500-3,000ms per action -- No caching mechanism - ---- - -## 7. Element Discovery - -### **Interactive Element Detection** ([src/context-providers/dom/elem-interactive.ts](src/context-providers/dom/elem-interactive.ts)) - -**Current Rules:** -```typescript -isInteractiveElem(element: HTMLElement): { isInteractive: boolean, reason?: string } -``` - -**Checks (in order):** -1. Native interactive tags: `button`, `a[href]`, `input`, `select`, `textarea` -2. ARIA roles: `button`, `link`, `tab`, `checkbox`, `menuitem` -3. Event listeners: `data-has-interactive-listener="true"` (injected) -4. Contenteditable elements -5. Elements with `onclick` attribute -6. Cursor style: `cursor: pointer` -7. Custom detection for common patterns - -**Ignored Elements:** -- Hidden elements (`display: none`, `visibility: hidden`) -- Zero-dimension elements -- Disabled elements -- Script and style tags - ---- - -## 8. Message Building - -### **Prompt Construction** ([src/agent/messages/builder.ts](src/agent/messages/builder.ts)) - -**Message Structure:** -```typescript -[ - { role: "system", content: SYSTEM_PROMPT }, - { role: "user", content: [ - { type: "text", text: "Task: click login button\n\nDOMState:\n[1]