Preprint 2026

Structured Distillation of Web Agent Capabilities Enables Generalization

Agent-as-Annotators replaces human annotation roles with LLM modules to synthesize web agent training data. A 9B model trained on 2,322 trajectories matches Qwen3.5-27B and nearly doubles the previous best open-weight result.

Xing Han LΓΉ  ·  Siva Reddy

McGill University & Mila

0 WebArena SR%
0 WorkArena L1 gain
0 Training trajectories

Frontier models are powerful but impractical

Large language models can now complete complex web tasks, from filling forms, querying databases, managing content across applications. On WebArena, frontier models exceed 50% success rate. But they require expensive API access, transmit user data to third-party servers, and cannot be run locally.

Small open-weight models (~9B parameters) are an attractive alternative, but trail frontier models by over 22 percentage points on WebArena. How do we close this gap?

Key insight: We can use a frontier model as a teacher to generate training trajectories for a smaller student, but the synthesis pipeline needs structure. We draw inspiration from how humans create web agent benchmarks.

From human annotation to LLM modules

When creating WebArena's evaluation tasks, human contributors played three distinct roles. Agent-as-Annotators replaces each with an LLM module:

Human Roles
LLM Modules
Role 1
Task Designer

Explores the environment, adopts a perspective, and creates tasks with evaluation criteria

β†’
Module 1 & 2
Persona + Task Generator

Generates diverse user personas, then synthesizes task intents with evaluation hints grounded in real environment state

Role 2
Annotator

Receives a task intent and executes it, producing a step-by-step interaction trajectory

β†’
Module 3
Agent

Receives only the task intent and interacts with a freshly reset environment. No hints, no exploration data

Role 3
Supervisor

Reviews the results to verify quality and task completion

β†’
Module 4
Judge

Evaluates each trajectory using the interaction record and evaluation hints for reliable success assessment

Only trajectories judged successful are used to train the student model.

A 9B model that matches 27B

A3-Qwen3.5-9B, fine-tuned on just 2,322 filtered trajectories from A3-Synth, achieves 41.5% on WebArena, matching the 3Γ— larger Qwen3.5-27B, approaching Gemini 3.1 Flash Lite (42.3%), and nearly doubling the previous best open-weight SFT result (Go-Browse, 21.7%).

WebArena (success rate %)

Gemini 3 Pro (teacher)
51.2%
Gemini 3.1 Flash L.
42.3%
A3-Qwen3.5-9B
41.5%
Qwen3.5-27B (3Γ— larger)
41.5%
Qwen3.5-9B (base)
31.0%
Go-Browse (prev. best SFT)
21.7%

Transfer to unseen benchmarks

The model was trained only on WebArena environments, yet capabilities transfer broadly. All four benchmarks below are completely out-of-distribution.

Best prev. open-weight (up to 72B) Qwen3.5-9B (base) A3-Qwen3.5-9B (ours) Qwen3.5-27B (3x larger) Gemini 3 Pro (teacher)
Visual
WebArena
Llama-3-70B
16.7%
+5.4
33.9%
37.4%
49.0%
WorkArena
GPT-oss-20B
38.5%
+18.2
51.5%
57.0%
79.7%
WorkArena++
3.0% AgentTrek
+7.5
9.7%
18.9%
41.6%
MiniWoB
OrbyAgent-72B
64.2%
+5.8
69.0%
70.9%
74.7%
80%60%40%20%0%

Try it yourself

Serve the model

bash # Install pip install agent-as-annotators # Serve with vLLM vllm serve McGill-NLP/A3-Qwen3.5-9B \ --tensor-parallel-size 2 \ --max-model-len 65536 \ --enforce-eager \ --dtype bfloat16

Run evaluation

bash # Evaluate on WebArena a3-eval \ --benchmark webarena_test \ --model A3-qwen3.5-9b

Download training data

python from huggingface_hub import snapshot_download snapshot_download( "McGill-NLP/A3-Synth", repo_type="dataset", local_dir="./data" )

Everything you need