Category

Structured browser perception

What it is, and why agents need it. Definitions, the reliability-compounding argument, and how the approaches compare.

Definition

Structured browser perception is the practice of giving an AI agent a machine-readable representation of a live rendered page — element identity, text, layout, state — instead of pixels or raw markup. The agent reasons over structure it can address, not an image it has to re-interpret or a markup dump it has to wade through.

Why the category exists: reliability compounds

Published evaluations of vision-based agents show per-action success rates around 80–85 percent on real desktop and web tasks. That sounds workable until the steps compound.

A 20-step workflow at 0.83 per action completes roughly 2 percent of the time. 0.8320 ≈ 0.02 Perception that resolves elements structurally rather than visually attacks the per-action term directly: raise the reliability of each step and the whole workflow stops collapsing. This is an argument about architecture, not about any single product.

Exact benchmark citation added when verified; no vendor or model names on this page.

The three approaches

Condensed from Runtime Snapshots 16. Three ways to give an agent a page, compared on the dimensions that decide real-world use.

ApproachCost / pageAuthenticated coverageElement addressingDrift
Screenshots + visionHigh (vision tokens)YesNoHigh
Static HTML / DOM extractionLowPartialYes (noisy)Medium
Runtime structural perceptionLowYesYesLow

Where E2LLM sits

E2LLM is one implementation of runtime structural perception. It captures the rendered page as a structured representation — see SiFR, the format — after the page has actually loaded, in the session the user is already in.

Further reading