PII handling patterns for analytics platforms.

A practical pattern library for tagging, masking, segregating and auditing PII across modern data stacks. The defaults we set, and why.

Reference architectureSeptember 20258 min read

PII handling is a discipline question that pretends to be a tooling question. The tools matter, but they only matter once a team has agreed on what is sensitive, who is allowed to see it, in what form, and under what audit obligation. The patterns below are the defaults we set on day one of a new platform. They are deliberately conservative; loosening them is a deliberate decision, not a default.

Pattern 1: Classify at ingestion, not at consumption

Every column landing in the platform gets a classification tag the moment it lands. The tag is part of the schema, propagates through lineage, and gates everything downstream. The taxonomy we use most often:

Public: published or freely shareable.
Internal: not sensitive but not for external eyes.
Confidential: business-sensitive (financials, strategy).
Restricted-PII: identifies a person directly (name, email, SSN, account).
Restricted-PHI: identifies a person in a health context.
Restricted-CUI: controlled unclassified information (public-sector).

Classification at consumption is too late. By the time a dashboard queries the column, the sensitivity has already propagated through joins, aggregations and possibly downstream sinks you do not control.

Pattern 2: Two stores, not one

We split storage into a raw zone and an analytic zone. The raw zone holds the data as it arrived, with PII intact, and is access-gated to a small number of identities (usually fewer than ten people in even a large organization). The analytic zone, where the rest of the organization works, only ever contains data that has been transformed to the appropriate level of de-identification.

The transformation between the two is itself the audit surface. Every row in the analytic zone has a documented provenance: which raw rows produced it, which transformations were applied, which classification tags were respected.

Pattern 3: Mask by default, reveal by exception

Columns flagged as restricted are masked or tokenized at the table level. The base view that analysts query returns*** or a deterministic token. A small number of named roles can request a row-level reveal through a separately audited interface. Revealed rows are logged.

This sounds restrictive and, in practice, is not. Ninety-five percent of analytics work, segmentation, cohort sizing, performance tracking, can be done against tokens. The five percent that requires cleartext is the five percent the audit log should reflect.

Pattern 4: Tokenize, do not hash

Plain hashing of PII is a frequent and serious mistake. Hashes are brute-forceable for short inputs (emails, phone numbers, account IDs); a small adversary with a wordlist can recover them. Use a keyed tokenization service. Tokens are deterministic for joins, but cannot be reversed without the key. The key lives in a secrets store, rotates, and is never embedded in transformation code.

Pattern 5: Lineage that includes PII flow

Standard lineage shows column-to-column provenance. PII-aware lineage adds the classification tag at every hop and flags any transformation where a column changes class without an explicit de-identification step. That flag is the failsafe for accidental re-introduction of identifiers, the most common way PII leaks into an analytic zone.

Pattern 6: Retention and right-to-be-forgotten as code

For GDPR, CCPA and analogous regimes, the platform needs to be able to honor a deletion request across every layer in finite time. We implement this as:

A canonical identity-to-subject mapping table in the raw zone.
A scheduled job that, given a subject ID, locates every downstream artifact derived from that identity and either deletes or re-tokenizes it.
A documented worst-case latency from request to completion (usually 30 days; we have shipped 72 hours).

Pattern 7: Access reviewed, not granted

Every access grant to a restricted-class table has a documented justification, a documented owner and an expiration date. Quarterly access reviews are automated: the owner of each table receives a list of everyone with access, why they were granted it, and the last time they used it. Unused access auto-revokes.

The mistake we see most often

The PII discipline gets retrofitted after a near miss, an audit finding, a leak, a regulator question. By that point the analytic zone is a tangle of joins that have to be unwound one at a time. The cost of building these patterns in from day one is small; the cost of bolting them on later is enormous.

None of these patterns are exotic. They are the defaults a mature platform should ship with, and the defaults that, when they are not there, become the most expensive remediation project in the organization’s data history.

PII handling patterns for analytics platforms.

Pattern 1: Classify at ingestion, not at consumption

Pattern 2: Two stores, not one

Pattern 3: Mask by default, reveal by exception

Pattern 4: Tokenize, do not hash

Pattern 5: Lineage that includes PII flow

Pattern 6: Retention and right-to-be-forgotten as code

Pattern 7: Access reviewed, not granted

The mistake we see most often

Related work.

Data Engineering

Healthcare

Shipping ML into regulated environments

Have a problem worth solving?