We benchmarked our PII detection against Microsoft Presidio — here's what we learned

Most PII detection benchmarks ask the wrong question. They count how many spans a system flagged and treat that number as quality.

It's not. A pipeline that flags every URL, every country mention, every department name produces 70+ false positives the reviewer has to dismiss by hand on a typical contract.

The right metric is regions the reviewer would keep vs regions the reviewer would dismiss.

We built a reproducible benchmark against Microsoft Presidio across 14 PDF documents in 7 European languages (en, fr, de, es, it, nl, pt) — two documents per locale: one contract, one financial statement. The script and the documents are public:{" "} github.com/promptshield-Inc/pii-detection-benchmarks .

The headline numbers

Across all 14 documents combined:

At first glance that's promptShield finding ~62% fewer spans. The naive read: Presidio finds more, so it's better.

But that reading collapses the moment you split by document class.

The split that matters: contracts vs financial statements

On contracts — where the data is real PII about contracting parties — both systems emit roughly the same set of regions.

promptShield's 209 high-signal spans across the 7 contracts in our corpus match Presidio's catches one-for-one on the entities that actually matter: names, addresses, phones, emails, business IDs.

On financial statements, the gap explodes. The French etats-financiers_beaumont document alone: Presidio emits 69 LOCATION spans, of which 73 are mentions of France, Paris, or country/city names appearing in jurisdictional prose (registered office, governing law, audit firm address).

promptShield emits 22 regions — every one of which is a real entity.

The 414-span gap between the two systems isn't capability. It's noise that Presidio's default install emits and promptShield filters.

What we filter that Presidio doesn't

After building the benchmark we found six clean categories of noise where default Presidio + spaCy small models over-emit.

1. URLs in document chrome

Presidio's UrlRecognizer flags every URL — privacy policy link in the footer, support page URL, contact form. ~8–10 spans per contract that no reviewer wants to redact.

2. Jurisdictional country mentions

"This Agreement is governed by the laws of France" — Presidio flags France as LOCATION. France in that sentence is contract chrome, not PII about the data subject. We built a multi-language filter_jurisdiction_boilerplate that detects 60+ jurisdictional phrases across 7 languages and drops any LOCATION/ORG span that falls inside the matched window.

3. PERSON fragments

BERT and spaCy NER models routinely emit the same person as overlapping spans: Pierre, Dubois, and Pierre Dubois all become separate PERSON entities. In our Italian contract: Presidio emits 26 PERSON spans for 10 distinct named parties. promptShield's filter_span_coalescence drops contained spans and keeps the longest form per page.

4. Generic-noun ORG

Marketing, Vorstand, Direction Générale, Board of Directors, Comitato di Direzione — Presidio's NER tags these as ORG. They're not organisations the reviewer wants to redact. We ship a 60–90-entry per-language stopword list (generic_orgs_<lang>.txt) that filters them.

5. Role titles tagged as PERSON

Capitalised role nouns at line start (CEO, Directeur, Geschäftsführer) sometimes get mis-tagged as PERSON by BERT/spaCy. A 50-entry multilingual stopset and a single filter drop them.

6. Standalone country/city in non-personal context

Paris mentioned as the audit firm's address in the contract footer isn't PII about the data subject — the same word in "born in Paris" is. We check the 40-char window before each LOCATION span for personal-context phrases (born in, resident of, née à, geboren in, etc.) and drop the bare ones.

The honest caveats

We are not claiming our pipeline is universally better. The caveats:

Default install only. A tuned Presidio install (custom recognizers, transformer NER backend, adjusted confidence thresholds) would close most of the precision gap. We measure what most Presidio users actually deploy in the first month — not what a Presidio expert could achieve.
No ground-truth labels. The "both / ours-only / presidio-only" metric is a precision proxy, not a strict F1 measurement. Hand-labelling 14 PDFs across 7 languages is ~40 hours of focused work by a native speaker; we haven't done that yet. PRs welcome.
Two document classes. Contracts + financial statements. Medical records, HR forms, immigration paperwork would produce different gaps — likely smaller, because those documents have richer PII signal.
Synthetic corpus. Real customer documents have OCR errors, multi-column layouts, and table-heavy financial appendices we don't measure here.

When to use Presidio instead

Presidio is a great toolkit. It's the right choice when:

you're building a custom Python DLP pipeline and want library-level control,
you need to integrate Azure AI Language or AWS Comprehend under one PII abstraction, or
you're processing text streams (not bounded documents).

The benchmark above is specifically about default-install behaviour on bounded PDF documents, which is what desktop document anonymization actually means in practice.

We were inspired by Presidio. We don't claim to replace it.

We claim that for the specific shape of work promptShield does — desktop anonymization of contracts, financial statements, medical records, HR documents — our pipeline produces output that requires less review work, with verifiable evidence.

Run the benchmark

The whole point of this post is the receipt:

{`git clone https://github.com/promptshield-Inc/pii-detection-benchmarks
cd pii-detection-benchmarks
pip install -r requirements.txt
python benchmark.py`}

The Presidio script is fully self-contained — you don't need promptShield installed to verify the Presidio side. The published CSVs in results/ are the numbers we cited above.

If you find a methodology issue, open an issue or PR.