Agent-as-Annotators replaces human annotation roles with LLM modules to synthesize web agent training data. A 9B model trained on 2,322 trajectories matches Qwen3.5-27B and nearly doubles the previous best open-weight result.
McGill University & Mila
Large language models can now complete complex web tasks, from filling forms, querying databases, managing content across applications. On WebArena, frontier models exceed 50% success rate. But they require expensive API access, transmit user data to third-party servers, and cannot be run locally.
Small open-weight models (~9B parameters) are an attractive alternative, but trail frontier models by over 22 percentage points on WebArena. How do we close this gap?
Key insight: We can use a frontier model as a teacher to generate training trajectories for a smaller student, but the synthesis pipeline needs structure. We draw inspiration from how humans create web agent benchmarks.
When creating WebArena's evaluation tasks, human contributors played three distinct roles. Agent-as-Annotators replaces each with an LLM module:
Explores the environment, adopts a perspective, and creates tasks with evaluation criteria
Generates diverse user personas, then synthesizes task intents with evaluation hints grounded in real environment state
Receives a task intent and executes it, producing a step-by-step interaction trajectory
Receives only the task intent and interacts with a freshly reset environment. No hints, no exploration data
Reviews the results to verify quality and task completion
Evaluates each trajectory using the interaction record and evaluation hints for reliable success assessment
Only trajectories judged successful are used to train the student model.
A3-Qwen3.5-9B, fine-tuned on just 2,322 filtered trajectories from A3-Synth, achieves 41.5% on WebArena, matching the 3Γ larger Qwen3.5-27B, approaching Gemini 3.1 Flash Lite (42.3%), and nearly doubling the previous best open-weight SFT result (Go-Browse, 21.7%).
The model was trained only on WebArena environments, yet capabilities transfer broadly. All four benchmarks below are completely out-of-distribution.
# Install
pip install agent-as-annotators
# Serve with vLLM
vllm serve McGill-NLP/A3-Qwen3.5-9B \
--tensor-parallel-size 2 \
--max-model-len 65536 \
--enforce-eager \
--dtype bfloat16
# Evaluate on WebArena
a3-eval \
--benchmark webarena_test \
--model A3-qwen3.5-9b
from huggingface_hub import snapshot_download
snapshot_download(
"McGill-NLP/A3-Synth",
repo_type="dataset",
local_dir="./data"
)
Full paper with framework details, ablations, and analysis
CodeTraining, evaluation, and data generation pipeline
A3-Synth Dataset16,353 training examples from 2,322 successful trajectories
A3-Qwen3.5-9BBest model. 41.5% WebArena, matches Qwen3.5-27B
A3-Qwen3.5-4BCompact variant. 35.2% WebArena
PyPI Packagepip install agent-as-annotators