We value your privacy. We use cookies to enhance your browsing experience, serve personalized ads or content, and analyze our traffic. By clicking "Accept All", you consent to our use of cookies. Read our Privacy Policy for more information.

Blogs

Engineering

Protecting privacy at scale: PII de-identification using GCP DLP API

Friday, July 25, 2025

Jens Putzeys

Software engineer

For one of our clients, we found ourselves dealing with a sensitive and increasingly common challenge: handling personal data responsibly. Specifically, we were working with large volumes of csv files containing personally identifiable information (PII) that needed to be anonymized before we could analyze them with LLMs.

To ensure we met strict privacy requirements while maintaining workflow efficiency, we turned to Google Cloud's Data Loss Prevention (DLP) API. It provided us with an automated, scalable way to inspect and de-identify sensitive data directly in the cloud.

A high-level overview of our system

We designed a data pipeline that integrates smoothly with our existing infrastructure while safeguarding user privacy:

Upload & quarantine
- Users upload csv files via the frontend.
- These files are temporarily stored in a dedicated quarantine bucket on Google Cloud Storage.
Triggering de-identification
- A Cloud Function is triggered upon file upload.
- This function uses the GCP DLP API to create a DLP job, which scans the file for sensitive information and applies a set of de-identification transformations.
Safe output for further processing
- Once the DLP job completes, the anonymized file is moved to a safe bucket.
- Our internal API is authorized to access this bucket, enabling us to continue processing the file for downstream tasks such as analyzing it.

DLP API templates: a configurable foundation

The DLP API relies on two key templates:

Inspection template: Defines which types of PII to search for (e.g., names, email addresses, phone numbers). Google offers a wide range of built-in detectors and you can add custom detectors based on regex patterns or dictionaries.
De-identification template: Specifies how identified PII should be anonymized. Options include masking, tokenization or replacement.

This separation of configuration and execution allowed us to iterate quickly on our data privacy strategy without modifying the core codebase.

A major benefit we discovered is that these templates can be fully defined using infrastructure as code tools like Terraform. This allows us to version, review, and deploy changes to our privacy configuration just like the rest of the application infrastructure.

Lessons learned

File format support is limited: The DLP API currently supports only a narrow range of file types, notably plain text and csv. For our use case, this was sufficient, but it’s important to note that files like PDFs, Excel spreadsheets, or binary formats are not supported directly.
Cost and latency trade-offs: Introducing a DLP job between uploading a file and processing it naturally adds latency and compute cost to the pipeline. While this is an acceptable trade-off for the privacy guarantees we gain, it's an important factor to consider when designing workflows that rely on timely data availability.

Final thoughts

By integrating GCP’s DLP API into our data processing pipeline, we were able to automate the anonymization of sensitive data in a scalable, reliable way. It allowed us to process files with the confidence that we were respecting user privacy every step of the way.

As data privacy regulations tighten globally and the volume of data continues to grow, tools like GCP DLP are becoming essential building blocks for responsible software engineering.

‍