E2LLM — Structured browser perception: what it is and why agents need it

Definition

Structured browser perception is the practice of giving an AI agent a machine-readable representation of a live rendered page — element identity, text, layout, state — instead of pixels or raw markup. The agent reasons over structure it can address, not an image it has to re-interpret or a markup dump it has to wade through.

Why the category exists: reliability compounds

Published evaluations of vision-based agents show per-action success rates around 80–85 percent on real desktop and web tasks. That sounds workable until the steps compound.

A 20-step workflow at 0.83 per action completes roughly 2 percent of the time. 0.83²⁰ ≈ 0.02 Perception that resolves elements structurally rather than visually attacks the per-action term directly: raise the reliability of each step and the whole workflow stops collapsing. This is an argument about architecture, not about any single product.

Exact benchmark citation added when verified; no vendor or model names on this page.

The three approaches

Condensed from Runtime Snapshots 16. Three ways to give an agent a page, compared on the dimensions that decide real-world use.

Approach	Cost / page	Authenticated coverage	Element addressing	Drift
Screenshots + vision	High (vision tokens)	Yes	No	High
Static HTML / DOM extraction	Low	Partial	Yes (noisy)	Medium
Runtime structural perception	Low	Yes	Yes	Low

Where E2LLM sits

E2LLM is one implementation of runtime structural perception. It captures the rendered page as a structured representation — see SiFR, the format — after the page has actually loaded, in the session the user is already in.

Structured browser perception

Definition

Why the category exists: reliability compounds

The three approaches

Where E2LLM sits

Further reading