PII protection looks completely different when AI is in the loop

For three decades, protecting personal data meant controlling where data was stored and who could reach the database. AI tools dismantled that model in about eighteen months.

This isn't a criticism of AI. It's a description of a structural shift — and understanding it is the difference between data protection that actually works and data protection that looks right on a compliance audit but fails in practice the moment a knowledge worker opens a chat window.

The old model — and why it worked

Traditional PII protection was designed for structured data: CRM records, EHR systems, HR databases, financial ledgers. The threat was unauthorized access — an attacker or unauthorized employee reaching data they weren't supposed to reach.

The response was correspondingly structured:

Tokenization and field-level encryption in databases: replace SSNs and account numbers with opaque tokens at rest. Only authorized services can resolve the token back to the original.
Role-based access controls: limit which employees can query which tables. Audit every read.
Network DLP (Data Loss Prevention): scan outbound traffic for structured PII patterns — SSN format, credit card numbers, IBAN patterns — and block exfiltration before it leaves the corporate perimeter.
Data masking in non-production environments: test with realistic-looking but synthetically generated data that contains no real identities.

This worked because data lived in bounded, known locations. You could enumerate the systems that held PII, wrap storage and network layers in controls, and those controls held against the threat model they were built for.

What AI tools changed

The threat surface moved — not because AI tools are insecure, but because they introduced a new, frictionless path from controlled storage to external processing.

Consider the workflow: a knowledge worker opens SharePoint, downloads a contract PDF, opens ChatGPT, and pastes the content. They were authorized to access the document — no access control was violated. They didn't exfiltrate data in the technical sense — they used an approved device on the corporate network. But the document, including every name, address, and financial figure in it, just transited to commercial cloud infrastructure that the organization has no visibility into, no contract governing, and no audit log of.

The old tooling has no handle on this:

Network DLP detects structured patterns in file uploads and outbound streams, but most AI interactions happen as HTTPS POST requests to a web UI — the proxy sees the destination domain, not the content of the message body.
Database tokenization protects PII at rest. By the time a user has downloaded a PDF and is reading it, the data has already been extracted from the token store, reconstructed, and rendered. The token system did exactly what it was supposed to do — it just wasn't designed for the next step.
Access controls were satisfied the moment the authorized user opened the file. What happens downstream of that access is outside their scope by design.

The old model assumed that controlling access to data meant controlling what happened to data. AI tools broke that assumption cleanly. Authorized access plus a chat window equals data leaving the organization — quietly, legally (from an access-control standpoint), and at volume.

Why cloud-side filtering falls short

One response to this gap is to route all AI traffic through a corporate proxy that intercepts requests, scans for PII, redacts it, and forwards the sanitized version. This approach exists and is better than nothing. But it has structural limits:

It requires SSL inspection of HTTPS traffic — technically invasive, operationally expensive, and increasingly challenged by browser security models and employee pushback on the grounds of privacy.
It requires configuration per AI tool. As the proliferation of AI tools accelerates, the configuration surface grows faster than any security team can keep pace with. A new AI tool becomes a gap the moment a user discovers it.
It catches structured PII patterns (SSN format, IBAN) but struggles with free-form names, addresses, and context-dependent identifiers — which are the majority of PII in professional documents.
It can't catch copy-paste from applications outside the proxy scope — mobile devices, personal machines, offline-then-paste workflows.

Cloud-side filtering treats AI as a channel to be supervised rather than a behavior to be restructured. The underlying action — a user extracting sensitive content from a controlled system and handing it to an external service — still happens. The filtering adds a speed bump in the middle. It doesn't address the root.

The comparison in plain terms

Dimension Traditional DLP Client-side anonymization

The right boundary: before the AI, on the machine

The intervention has to happen before the data leaves the user's device — not at the network perimeter, not inside the AI provider's infrastructure after the fact.

This requires a different mental model. Instead of asking "how do we supervise what users send to AI tools?", the question becomes: "how do we give users a version of their documents that is structurally safe to send to any AI tool?"

The key insight is that AI doesn't need identities to do its job. It needs structure, language, and context. "Alice Martin, born 14 March 1987, residing at 14 rue des Acacias, Lyon, client reference A-2091" tells a large language model nothing additional about the contractual clause it's being asked to analyze. Strip the identity, keep the clause. The AI's output is equally useful — and the original data never left the machine.

How promptShield implements this

promptShield runs entirely offline on the user's device — no cloud inference, no network call during anonymization. The detection pipeline runs three layers against each document:

Regex — high-precision structured patterns: IBAN, SSN, email, phone, passport number and other document ID formats across 7 European languages. These fire first, with high precision, and don't require a model load.
NER (Named Entity Recognition) — spaCy and transformer models running locally, detecting names, addresses, organisations, and dates in context. The same entity name recurring across pages is linked, so anonymizing on page 1 suppresses it automatically on page 4.
Optional LLM pass — for ambiguous cases, a locally-running GGUF model resolves borderline detections without sending data anywhere.

Each detected entity is replaced with a stable, typed token: [PERSON_1], [ADDRESS_2], [IBAN_1]. The mapping between token and original value is stored in a local SQLite database — the token registry — that never leaves the machine.

The user pastes the anonymized document into any AI tool. The AI produces its analysis against the tokens. When the user wants to apply the AI's output back to the original document, promptShield substitutes the real values. The analytical workflow is unchanged. The exposure is structurally eliminated.

What changes, and what doesn't

Client-side anonymization addresses one specific gap: the frictionless path from a controlled document to an external AI service. It doesn't replace access controls, doesn't replace audit logging, and doesn't remove the need to train employees on appropriate AI use.

What it does change is the risk model for the AI use case specifically. Once you anonymize before pasting, the question "what happens if an employee uses ChatGPT on a client document?" has a different answer. The AI receives structured content — clause logic, financial ratios, contract architecture — with no identifiers attached. There is nothing to expose.

The old perimeter was the document store. The new perimeter has to be the document itself — specifically, the moment it transitions from a bounded system to free text in a user's hands.

That boundary is the one traditional DLP was never designed to hold. It's the one that matters now.