Preprint 2026

Structured Distillation of Web Agent Capabilities Enables Generalization

Agent-as-Annotators replaces human annotation roles with LLM modules to synthesize web agent training data. A 9B model trained on 2,322 trajectories matches Qwen3.5-27B and nearly doubles the previous best open-weight result.

Xing Han Lù · Siva Reddy

McGill University & Mila

0 WebArena SR%

0 WorkArena L1 gain

0 Training trajectories

The Problem

Frontier models are powerful but impractical

Large language models can now complete complex web tasks, from filling forms, querying databases, managing content across applications. On WebArena, frontier models exceed 50% success rate. But they require expensive API access, transmit user data to third-party servers, and cannot be run locally.

Small open-weight models (~9B parameters) are an attractive alternative, but trail frontier models by over 22 percentage points on WebArena. How do we close this gap?

Key insight: We can use a frontier model as a teacher to generate training trajectories for a smaller student, but the synthesis pipeline needs structure. We draw inspiration from how humans create web agent benchmarks.

The Framework

From human annotation to LLM modules

When creating WebArena's evaluation tasks, human contributors played three distinct roles. Agent-as-Annotators replaces each with an LLM module:

Human Roles

LLM Modules

Role 1

Task Designer

Explores the environment, adopts a perspective, and creates tasks with evaluation criteria

→

Module 1 & 2

Persona + Task Generator

Generates diverse user personas, then synthesizes task intents with evaluation hints grounded in real environment state

Role 2

Annotator

Receives a task intent and executes it, producing a step-by-step interaction trajectory

→

Module 3

Agent

Receives only the task intent and interacts with a freshly reset environment. No hints, no exploration data

Role 3

Supervisor

Reviews the results to verify quality and task completion

→

Module 4

Judge

Evaluates each trajectory using the interaction record and evaluation hints for reliable success assessment

Only trajectories judged successful are used to train the student model.

Results

A 9B model that matches 27B

A3-Qwen3.5-9B, fine-tuned on just 2,322 filtered trajectories from A3-Synth, achieves 41.5% on WebArena, matching the 3× larger Qwen3.5-27B, approaching Gemini 3.1 Flash Lite (42.3%), and nearly doubling the previous best open-weight SFT result (Go-Browse, 21.7%).

WebArena (success rate %)

Gemini 3 Pro (teacher)

51.2%

Gemini 3.1 Flash L.

42.3%

A3-Qwen3.5-9B

41.5%

Qwen3.5-27B (3× larger)

41.5%

Qwen3.5-9B (base)

31.0%

Go-Browse (prev. best SFT)

21.7%

Transfer to unseen benchmarks

The model was trained only on WebArena environments, yet capabilities transfer broadly. All four benchmarks below are completely out-of-distribution.

Best prev. open-weight (up to 72B) Qwen3.5-9B (base) A3-Qwen3.5-9B (ours) Qwen3.5-27B (3x larger) Gemini 3 Pro (teacher)

Visual
WebArena

Llama-3-70B

16.7%

+5.4

33.9%

37.4%

49.0%

WorkArena

GPT-oss-20B

38.5%

+18.2

51.5%

57.0%

79.7%

WorkArena++

3.0% AgentTrek

+7.5

9.7%

18.9%

41.6%

MiniWoB

OrbyAgent-72B

64.2%

+5.8

69.0%

70.9%

74.7%

80%60%40%20%0%

Quick Start

Try it yourself

Serve the model

        bash
        # Install
pip install agent-as-annotators

# Serve with vLLM
vllm serve McGill-NLP/A3-Qwen3.5-9B \
  --tensor-parallel-size 2 \
  --max-model-len 65536 \
  --enforce-eager \
  --dtype bfloat16
      

Run evaluation

        bash
        # Evaluate on WebArena
a3-eval \
  --benchmark webarena_test \
  --model A3-qwen3.5-9b
      

Download training data

        python
        from huggingface_hub import snapshot_download

snapshot_download(
    "McGill-NLP/A3-Synth",
    repo_type="dataset",
    local_dir="./data"
)