WebPII: Benchmarking Visual PII Detection for Computer-Use Agents

Anonymous Authors

Abstract

Computer use agents create new privacy risks: training data collected from real websites inevitably contains sensitive information, and cloud-hosted inference exposes user screenshots. Detecting personally identifiable information in web screenshots is critical for privacy-preserving deployment, but no public benchmark exists for this task. We introduce WebPII, a fine-grained synthetic benchmark of 44,865 annotated e-commerce UI images designed with three key properties: extended PII taxonomy including transaction-level identifiers that enable reidentification, anticipatory detection for partially-filled forms where users are actively entering data, and scalable generation through VLM-based UI reproduction. Experiments validate that these design choices improve layout-invariant detection across diverse interfaces and generalization to held-out page types. We train WebRedact to demonstrate practical utility, more than doubling text-extraction baseline accuracy (0.753 vs 0.357 mAP@50) at real-time CPU latency (20ms). We release the dataset and model to support privacy-preserving computer use research.

Dataset Composition and Statistics

The dataset comprises 44,865 images spanning 10 e-commerce websites and 19 page types.

44,865
Images
993,461
Annotations
10
Companies
19
Page Types

Form-Fill Variants

Distribution across form-fill variants—enabling anticipatory detection training.

  • Empty (13.4%): 6,012 images
  • Fully-filled (22.7%): 10,200 images
  • Partial-fill (63.9%): 28,653 images

Annotation Density

Median of 19 boxes per image (mean 22.1), ranging from 0 to 145 annotations per image.

  • PII classes (52.4%): Address (25.7%), contact info, names
  • Non-PII classes (47.6%): Order info (23.0%), product text (18.8%)

HTML Element Types

Most annotations target rendered text where PII appears on confirmation and review pages.

  • Rendered text (78.1%): 775,772 annotations
  • Input fields (13.6%): 135,136 annotations
  • Images (8.3%): 82,553 annotations

Interactive Dataset Viewer

Sample images from the WebPII dataset. Toggle between form states and annotation modes.

Baseline Results

Visual detection substantially outperforms text-based approaches on held-out Amazon layouts.

Method mAP@50 Latency Real-time?
OCR + Presidio 0.183 1,300ms No
LayoutLMv3 + GPT-4o-mini 0.357 2,900ms No
WebRedact (ours) 0.753 20ms Yes (30 FPS)
WebRedact-large (ours) 0.842 312ms Near Real-time (3 FPS)

WebRedact achieves 2.1× better accuracy (0.753 vs 0.357 mAP@50) with 145× faster inference than the best text-based method (LayoutLMv3 + GPT-4o-mini), enabling real-time privacy protection at 30 FPS. WebRedact-large further improves accuracy to 0.842 mAP@50 with near-real-time performance at 3 FPS.

BibTeX

@inproceedings{anonymous2026webpii,
  title={WebPII: Benchmarking Visual PII Detection for Computer-Use Agents},
  author={Anonymous Authors},
  booktitle={International Conference on Machine Learning},
  year={2026}
}