Advancing the limits of autonomous computer use.
State-of-the-art benchmark results across the industry’s most demanding evaluations for browser reasoning and multi-step execution.
Vision-First Paradigm
We build agents that see and act like humans. Most approaches assume software is machine-readable through APIs, protocols, or structured text. This assumption breaks in real environments.
Software is built for human interaction. Interfaces are visual, dynamic, and inconsistent. Agents must operate on what is actually rendered, not what is theoretically structured.
We take a pure vision approach, where agents interact through the screen, keyboard, and mouse. This allows reliable execution across any environment without requiring integrations or structured access.
There are three primary approaches to how agents interact with software environments:
Protocol-based systems rely on APIs or MCP-style communication, which assume structured access that most legacy systems do not provide.
Text-based systems process HTML or accessibility layers but fail in dynamic or poorly structured interfaces.
Vision-based systems operate directly on the rendered interface, observing and acting on what actually exists.
We adopt the vision-based approach because it matches how software is actually used and avoids structural failure modes.
Full-Stack Execution Infrastructure
Vision-based systems fail without infrastructure designed for execution speed and reliability.
We solve this by owning and optimizing the full agent stack.
We build our own environment infrastructure, including the browser and execution layer, specifically optimized for agent interaction.
We reduce latency across perception and action loops, enabling fast and reliable execution.
We design systems for long-horizon tasks, where agents must operate across many steps without failure.
This enables fast, reliable execution across real-world systems.


Evaluates real-world, multi-step browser task execution
WebVoyager is one of the most recognized benchmarks for autonomous web agents, measuring performance on real browser tasks across 15 live websites and more than 600 multi-step tasks. Agents must reason, navigate changing interfaces, and complete open-ended objectives under real web conditions.
Pinetree achieves 99%, reaching near-production reliability for complex browser-based execution.

Measures generalization across dynamic web environments and open-ended tasks
Online-Mind2Web is a large-scale benchmark for measuring how effectively agents understand intent, interpret live web interfaces, and execute real online tasks across diverse environments. Unlike static evaluations, it tests generalization under changing layouts and open-ended objectives.
Pinetree Agent achieves 90%, leading competing systems and demonstrating advanced robustness in real-world web interaction. Results at this level suggest strong transferability beyond narrow scripted tasks.

Hallucinate Westworld tests true generalization.
We evaluated performance in an environment not present in any training set, ensuring results reflect zero-shot execution rather than memorization. Pinetree-CUA achieved 93%, outperforming all agents, including Yutori Navigator, which was directly trained on this environment using reinforcement learning.
Despite zero prior exposure, our system performed better.
This demonstrates that general reasoning infrastructure transfers more effectively than environment-specific training. Enterprise workflows are inherently data-sparse and environment-dependent. There is no scalable dataset for how humans operate internal tools.
Systems that rely on end-to-end training either fail to generalize or require infeasible amounts of data. We address this by separating reasoning from execution. A general reasoning engine handles planning, while a specialized world model learns mechanistic interaction within environments.
Result: agents that generalize broadly and specialize precisely.