Examples

Step-by-step extraction examples for common use cases

These examples walk through practical extraction setups — schema, website description, and expected output.


Hacker News front page

Goal: Extract the current top stories.

Schema

{
  "type": "array",
  "items": {
    "type": "object",
    "properties": {
      "rank":          { "type": "number" },
      "title":         { "type": "string" },
      "url":           { "type": "string", "format": "uri" },
      "points":        { "type": "number" },
      "author":        { "type": "string" },
      "comment_count": { "type": "number" },
      "posted_time":   { "type": "string" }
    }
  }
}

Setup

  • URL: https://news.ycombinator.com/
  • Description: "Extract all stories on the front page. Each story has a title, link, point count, author, comment count, and rank number."

What the agent does

Hacker News uses a simple HTML table. The agent finds the repeating row structure and maps all 30 stories in one pass. No scrolling or pagination is needed for the default front page.

Tips

  • To collect more than 30 stories, add to the description: "Click the 'More' link at the bottom to load the next page. Collect stories from the first 3 pages."
  • Ask HN / Show HN posts link to the HN discussion thread, not an external URL. The url field will be the HN thread URL in these cases.

Sports event listings

Goal: Extract upcoming events from a ticketing or league schedule page.

Schema

{
  "type": "array",
  "items": {
    "type": "object",
    "properties": {
      "event_name":  { "type": "string" },
      "date":        { "type": "string", "format": "date" },
      "time":        { "type": "string" },
      "venue":       { "type": "string" },
      "home_team":   { "type": "string" },
      "away_team":   { "type": "string" },
      "price_from":  { "type": "number", "description": "lowest available ticket price in USD" },
      "tickets_url": { "type": "string", "format": "uri" }
    }
  }
}

Setup

  • URL: The event listing page URL
  • Description: "This is a sports event listing page. Extract all upcoming events including team names, dates, venues, and lowest ticket price. Scroll down to load all events if necessary."

What the agent does

Most event listing pages are JavaScript-heavy. The agent waits for content to render, scrolls to trigger lazy loading, and identifies the repeating event card pattern. If filter tabs are present (This Week / This Month / All), it extracts from the default view unless instructed otherwise.

Tips

  • For heavily bot-protected sites (Ticketmaster, StubHub), use the local browser (Camoufox/Firefox). It handles fingerprint detection significantly better than Chromium.
  • Location-based filtering — some sites show different events based on the browser's detected location. Specify a city or region in the description if needed.
  • If the site exposes its event data via a network API (common on React/Next.js sites), the single-page agent will find and use it — producing faster and more complete results than DOM parsing.

Product listings

Goal: Extract a product category page from an e-commerce store.

Schema

{
  "type": "array",
  "items": {
    "type": "object",
    "properties": {
      "product_name":   { "type": "string" },
      "price":          { "type": "number" },
      "original_price": { "type": "number", "description": "pre-discount price if on sale" },
      "rating":         { "type": "number" },
      "review_count":   { "type": "number" },
      "in_stock":       { "type": "boolean" },
      "product_url":    { "type": "string", "format": "uri" }
    }
  }
}

Setup

  • URL: A product category page (e.g., https://store.example.com/laptops)
  • Description: "This is an e-commerce product listing page. Extract all products including name, price, rating, and availability. Skip sponsored or ad results."

What the agent does

The agent identifies the repeating product card structure, handles price formatting (strips currency symbols, parses "was / now" pricing), and scrolls for infinite-scroll pages. For React-based stores, it often finds the product data in network responses or __NEXT_DATA__ — returning cleaner data than DOM scraping.

Tips

  • original_price will be null for items not on sale — the schema handles missing fields gracefully with null.
  • For Amazon, the page HTML is complex. The agent tends to perform better using the network API approach if the product listing data is available via XHR.
  • Add "required": ["product_name", "price"] to your schema to ensure the agent only returns rows where both are present.

General tips

Be specific in descriptions. "Extract all product listings from the search results grid, including the price shown in the bottom-right of each card" is better than "get the products."

Start with a single page. Confirm the output matches your schema before scaling to pagination or multi-page extractions.

Use the live view. Watch the agent in real time. If it navigates somewhere unexpected, stop the run, adjust the description, and retry.

Save playbooks. Once an extraction works, save it. Playbook runs skip the full AI exploration and execute the saved script directly — faster and cheaper for recurring use.