Article
piiredactionpresidioprivacygdprhipaapythonnlp
Redact PII from Text with Microsoft Presidio
Use Microsoft Presidio to detect and redact personally identifiable information (PII) from text. This Python-based tool helps build privacy-preserving applications compliant with regulations like GDPR and HIPAA by replacing sensitive data with placeholders or hashes.
beginner15 min4 steps
The play
- Install Presidio LibrariesFirst, install the two core Microsoft Presidio packages using pip. `presidio-analyzer` is for detecting PII, and `presidio-anonymizer` is for redacting or replacing it.
- Detect PII EntitiesUse the `AnalyzerEngine` to find PII in a string. It loads a set of default recognizers for common entities like names, phone numbers, and credit card numbers. The `analyze` method returns a list of found entities, their type, and their location.
- Anonymize Detected PIIUse the `AnonymizerEngine` to redact the entities found by the analyzer. Pass the analyzer results to the anonymizer's `anonymize` method. By default, it replaces each entity with its type in angle brackets (e.g., 'John Doe' becomes '<PERSON>').
- Customize Redaction RulesYou can customize how specific entities are redacted using `OperatorConfig`. For example, you can choose to 'replace' a person's name with a fixed placeholder and 'hash' a phone number, providing more granular control over the output.
Starter code
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import OperatorConfig
# 1. Define the text with PII
text_with_pii = "A message from Jane Doe to customer support. My phone number is 212-555-1234 and my email is jane.d@email.com."
# 2. Initialize the AnalyzerEngine
analyzer = AnalyzerEngine()
# 3. Call analyzer to find PII entities
analyzer_results = analyzer.analyze(text=text_with_pii, language='en')
# 4. Initialize the AnonymizerEngine
anonymizer = AnonymizerEngine()
# 5. Define custom redaction rules (operators)
# - Replace PERSON with a static value
# - Redact EMAIL with asterisks
# - Hash PHONE_NUMBER using SHA256
custom_operators = {
"PERSON": OperatorConfig("replace", {"new_value": "[REDACTED_CUSTOMER]"}),
"EMAIL_ADDRESS": OperatorConfig("mask", {"type": "mask", "masking_char": "*", "chars_to_mask": 15, "from_end": False}),
"PHONE_NUMBER": OperatorConfig("hash", {"hash_type": "sha256"})
}
# 6. Anonymize the text using the custom rules
anonymized_output = anonymizer.anonymize(
text=text_with_pii,
analyzer_results=analyzer_results,
operators=custom_operators
)
# --- Print results ---
print(f"Original Text:\n{text_with_pii}\n")
print(f"PII Entities Found:\n{analyzer_results}\n")
print(f"Anonymized Text:\n{anonymized_output.text}")