Measured 2026-05-27 · promptShield 1.0.14 · Presidio 2.2.362
promptShield vs Microsoft Presidio
Benchmarked head-to-head on 14 PDF documents across 7 languages. Same text extraction, same documents, no asymmetric tuning. Published script and raw CSVs — verify it yourself.
TL;DR
Presidio (default install)
666
spans emitted · high noise level
promptShield 1.0.14
252
spans emitted · same corpus · noise filtered
On contract documents (where the data is real PII), promptShield emits the same 209 high-signal spans as Presidio. The 414-span gap is on financial statements, where raw entity hits are 30–70% noise (audit firm names, registered offices, jurisdictional country mentions).
Feature comparison
| Feature | promptShield | Presidio |
|---|---|---|
Multi-language regex recognizer (60+ types) Presidio: EN-primary by default | ◦ | |
Multilingual entity model (Davlan) Presidio: requires custom HuggingFace entity recognizer | ◦ | |
Optional LLM layer for contextual PII | ||
US NPI / ABA / DEA checksum validation Presidio: NPI and MBI included, ABA/DEA partial | ◦ | |
AU TFN / ABN / ACN / Medicare checksum validation | ||
BR CPF / CNPJ checksum validation | ||
CH AHV checksum validation | ||
MX CLABE checksum validation | ||
NL BSN / RSIN checksum validation | ||
French NIR (social security) validation | ||
Jurisdiction-boilerplate filter "governed by the laws of X" → X not flagged | ||
Intra-page PERSON span coalescence "Pierre Dubois" kept, bare "Pierre" dropped | ||
Generic-department ORG filter | ||
Role-title PERSON filter | ||
Finished desktop application with review GUI | ||
Reversible tokenization (encode / decode) Presidio: separate anonymizer layer | ◦ | |
100% offline (no cloud dependency) | ||
Custom recognizers via API | ||
Python library for custom integration | ||
Azure AI Language integration |
✓ = included by default · ◦ = available via custom configuration · ✗ = not present
Noise categories Presidio (default install) emits and promptShield suppresses
URLs in contract footers
Privacy policy link, support page URL, generic legal contact. ~8–10 spans per contract, virtually none of which is PII the reviewer wants to redact.
UrlRecognizerStandalone country / city mentions in jurisdictional prose
"governed by the laws of France", "registered office in Paris", "company incorporated in Germany". In a French financial statement, Presidio emits 73 LOCATION spans; almost all are contract chrome.
filter_jurisdiction_boilerplate + filter_standalone_countryPERSON fragments
The same name appears as three separate spans: "Pierre", "Dubois", "Pierre Dubois". On an Italian contract, Presidio emits 26 PERSON spans for 10 distinct named parties.
filter_span_coalescenceGeneric department names tagged as ORG
"Marketing", "Vorstand", "Direction Générale", "Board of Directors", "Comitato di Direzione". These aren't organisations the reviewer wants to redact.
filter_generic_org (90-entry stoplist × 7 languages)Role titles tagged as PERSON
"CEO", "Directeur", "Geschäftsführer" capitalised at line start, sometimes mis-tagged as PERSON by the entity model.
filter_role_titlesWhen to choose Presidio instead
Presidio is the right choice when:
- You're building a custom Python-based DLP pipeline and want library-level control.
- You need to integrate cloud services (Azure AI Language, AWS Comprehend) under one PII abstraction.
- You want to fine-tune your own recognizers for proprietary entity types.
- You're processing text streams (not bounded documents) and need a service-shaped library.
Both Presidio (by Microsoft) and promptShield are MIT-spirit projects. Use the right tool for the job.
When to choose promptShield
- You're anonymizing bounded PDF/DOCX/XLSX documents on the desktop.
- You need country-specific checksums out of the box (TFN, CPF, NPI, IBAN, etc.) without writing them yourself.
- You want a finished GUI workflow (review, redact, tokenize, export) rather than a library to integrate.
- Your customers can't send documents to a cloud service for compliance reasons.
Reproduce it yourself
Every number on this page comes from the script published in the public repo. No number is sourced from an internal, non-verifiable run.
git clone https://github.com/promptshield-Inc/pii-detection-benchmarks
cd pii-detection-benchmarks
pip install -r requirements.txt
python benchmark.pyOutputs: results/presidio_counts_<date>.csv + results/presidio_entities_<date>.csv. The Presidio script is fully standalone — you don't need promptShield installed to verify the Presidio side.
Honest caveats
- Default install only. A tuned Presidio install (custom recognizers + transformer entity backend + tuned confidence thresholds) would close most of the precision gap. We measure what most Presidio users deploy in the first month.
- Synthetic corpus. Real customer documents have richer noise (OCR errors, scanned originals, multi-column layouts) we don't measure here.
- No ground-truth labels. The "both / ours-only / presidio-only" numbers are a precision proxy, not a strict F1 measurement. Hand-labelling 14 PDFs across 7 languages is ~40 hours of work; we haven't done that yet.
- Two document classes. Contracts + financial statements. Medical records, HR forms, immigration paperwork would produce different gaps.