WebPII: Benchmarking Visual PII Detection for Computer-Use Agents

Abstract

Computer use agents create new privacy risks: training data collected from real websites inevitably contains sensitive information, and cloud-hosted inference exposes user screenshots. Detecting personally identifiable information in web screenshots is critical for privacy-preserving deployment, but no public benchmark exists for this task. We introduce WebPII, a fine-grained synthetic benchmark of 44,865 annotated e-commerce UI images designed with three key properties: extended PII taxonomy including transaction-level identifiers that enable reidentification, anticipatory detection for partially-filled forms where users are actively entering data, and scalable generation through VLM-based UI reproduction. Experiments validate that these design choices improve layout-invariant detection across diverse interfaces and generalization to held-out page types. We train WebRedact to demonstrate practical utility, more than doubling text-extraction baseline accuracy (0.753 vs 0.357 mAP@50) at real-time CPU latency (20ms). We release the dataset and model to support privacy-preserving computer use research.

Dataset Composition and Statistics

The dataset comprises 44,865 images spanning 10 e-commerce websites and 19 page types.

44,865

Images

993,461

Annotations

10

Companies

19

Page Types

Form-Fill Variants

Distribution across form-fill variants—enabling anticipatory detection training.

Empty (13.4%): 6,012 images
Fully-filled (22.7%): 10,200 images
Partial-fill (63.9%): 28,653 images

Annotation Density

Median of 19 boxes per image (mean 22.1), ranging from 0 to 145 annotations per image.

PII classes (52.4%): Address (25.7%), contact info, names
Non-PII classes (47.6%): Order info (23.0%), product text (18.8%)

HTML Element Types

Most annotations target rendered text where PII appears on confirmation and review pages.

Rendered text (78.1%): 775,772 annotations
Input fields (13.6%): 135,136 annotations
Images (8.3%): 82,553 annotations

Interactive Dataset Viewer

Sample images from the WebPII dataset. Toggle between form states and annotation modes.

Baseline Results

Visual detection substantially outperforms text-based approaches on held-out Amazon layouts.

Method	mAP@50	Latency	Real-time?
OCR + Presidio	0.183	1,300ms	No
LayoutLMv3 + GPT-4o-mini	0.357	2,900ms	No
WebRedact (ours)	0.753	20ms	Yes (30 FPS)
WebRedact-large (ours)	0.842	312ms	Near Real-time (3 FPS)

WebRedact achieves 2.1× better accuracy (0.753 vs 0.357 mAP@50) with 145× faster inference than the best text-based method (LayoutLMv3 + GPT-4o-mini), enabling real-time privacy protection at 30 FPS. WebRedact-large further improves accuracy to 0.842 mAP@50 with near-real-time performance at 3 FPS.

BibTeX

@inproceedings{anonymous2026webpii,
  title={WebPII: Benchmarking Visual PII Detection for Computer-Use Agents},
  author={Anonymous Authors},
  booktitle={International Conference on Machine Learning},
  year={2026}
}