Demonstrating True Generalization on Hallucinate Westworld

Pinetree Research announces a new benchmark result: Pinetree Agent achieves 93% on Hallucinate Westworld, demonstrating state-of-the-art performance in a fully unseen environment.

Hallucinate Westworld is designed to test true generalization. The environment is not present in any model’s training data, ensuring that performance reflects zero-shot execution rather than memorization or fine-tuning.

Unlike standard benchmarks, this evaluation removes shortcuts. Agents must operate without prior exposure, adapt to unfamiliar interfaces, and complete multi-step workflows in a completely new setting.

At 93%, Pinetree Agent outperforms all compared systems, including Yutori Navigator (86%), Claude Sonnet 4.5 (67.7%), Gemini 2.5 Pro (54%), and OpenAGI Lux (40%). Notably, Yutori Navigator was trained directly on this environment using reinforcement learning, while Pinetree Agent had no prior exposure.

This result highlights a fundamental limitation in current approaches. Most systems rely on environment-specific training or large-scale datasets to achieve performance. However, enterprise workflows are inherently data-sparse and environment-dependent. There is no scalable dataset for how humans interact with internal tools.

As a result, end-to-end training approaches face a tradeoff. They either achieve specialization without generalization, or require infeasible amounts of data across environments.

Pinetree addresses this by separating reasoning from execution.

The system combines a general reasoning engine, built on state-of-the-art foundation models, with a specialized world model that captures mechanistic understanding of specific environments. This world model predicts the consequences of actions and enables forward simulation, allowing the agent to operate reliably even in unfamiliar systems.

This architecture allows Pinetree Agent to generalize across environments while still achieving high execution reliability.

Why Hallucinate Westworld Matters

Hallucinate Westworld represents a more realistic proxy for enterprise deployment than traditional benchmarks.

It evaluates whether agents can:

Operate without prior exposure to an environment
Adapt to new interfaces and workflows
Execute long-horizon, multi-step tasks
Maintain reliability under uncertainty

Strong performance on this benchmark demonstrates that the system is not relying on memorization, but instead exhibits transferable execution capability.

This marks a shift from narrow automation toward true autonomous execution.

Pinetree Agent is designed not just to perform well on known tasks, but to operate reliably in environments it has never seen before.