Presidio是由微软维护的开源数据保护项目,其主要包含三个功能模块,分别是:
- Presidio analyzer:该模块主要负责文本类数据敏感信息扫描。
- Presidio anonymizer:该模块主要负责对已检测到的敏感实体进行脱敏处理。
- Presidio image redactor:该模块主要利用OCR技术识别图片中敏感信息并进行脱敏处理。
anonymizer模块的设计图如下所示:
匿名器中主要包含两个功能类,分别是Anonymizers类和Deanonymizers。其中Anonymizers类主要负责对敏感数据实体进行脱敏操作。Deanonymizers主要负责对脱敏后的数据还原,例如对加密后的敏感数据解密。
对于匿名器的创建和使用可以参考【数据保护】微软开源数据保护项目Presidio-从入门到精通这篇文章对于匿名器的介绍。本篇文章主要介绍匿名器内置的脱敏操作以及敏感实体交叠情况下的判定。
匿名化
replace
将敏感数据实体替换成期望呈现的值。
参数:
- new_value:将敏感数据替换成此参数给定的值。如果此参数没有传参,那么默认情况下使用敏感数据类型名替换,例如
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import RecognizerResult, OperatorConfig
text = "My name is Bond, James Bond"
engine = AnonymizerEngine()
result = engine.anonymize(
text= text,
analyzer_results=[
RecognizerResult(entity_type="PERSON", start=11, end=15, score=0.8),
RecognizerResult(entity_type="PERSON", start=17, end=27, score=0.8),
],
operators={"PERSON": OperatorConfig("replace", {"new_value": "XXX"})},
)
print("替换前:", text)
print("替换前:", teresultxt)
执行结果:
替换前: My name is Bond, James Bond
替换后: My name is XXX, XXX
redact
删除敏感数据内容。
参数:无
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import RecognizerResult, OperatorConfig
text = "My name is Bond, James Bond"
engine = AnonymizerEngine()
result = engine.anonymize(
text= text,
analyzer_results=[
RecognizerResult(entity_type="PERSON", start=11, end=15, score=0.8),
RecognizerResult(entity_type="PERSON", start=17, end=27, score=0.8),
],
operators={"PERSON": OperatorConfig("redact")},
)
print("替换前:", text)
print("替换后:", result.text)
执行结果:
替换前: My name is Bond, James Bond
替换后: My name is ,
mask
将敏感数据内容替换为指定的字符。
参数:
- chars_to_mask:应该替换的敏感数据长度;
- masking_char:替换的字符;
- from_end:是否从敏感内容的结尾开始替换;
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import RecognizerResult, OperatorConfig
text = "My name is Bond, James Bond, phone num is 212-555-5555"
engine = AnonymizerEngine()
result = engine.anonymize(
text= text,
analyzer_results=[
RecognizerResult(entity_type="PHONE_NUMBER", start=42, end=54, score=0.8),
],
operators={"PHONE_NUMBER": OperatorConfig("mask", {"chars_to_mask": 12, "masking_char": "*", "from_end": True})},
)
print("替换前:", text)
print("替换后:", result.text)
执行结果:
替换前: My name is Bond, James Bond, phone num is 212-555-5555
替换后: My name is Bond, James Bond, phone num is ************
hash
使用敏感数据的哈希值替换原有值。
参数:
- hash_type:设置哈希算法,可以是sha256、sha512或者md5。默认情况下是sha256。
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import RecognizerResult, OperatorConfig
text = "My name is Bond, James Bond"
engine = AnonymizerEngine()
result = engine.anonymize(
text= text,
analyzer_results=[
RecognizerResult(entity_type="PERSON", start=11, end=15, score=0.8),
RecognizerResult(entity_type="PERSON", start=17, end=27, score=0.8),
],
operators={"PERSON": OperatorConfig("hash", {"hash_type":"md5"})},
)
print("替换前:", text)
print("替换后:", result.text)
执行结果:
替换前: My name is Bond, James Bond
替换后: My name is 7a3b691731db2969498907b960183633, 0a363424ef5ceaa17d33dfe4c545d7f3
encrypt
使用给定的Key对敏感数据加密后替换原有数据内容。
参数:
- key:一段字符串作为key。
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import RecognizerResult, OperatorConfig
text = "My name is Bond, James Bond"
engine = AnonymizerEngine()
result = engine.anonymize(
text= text,
analyzer_results=[
RecognizerResult(entity_type="PERSON", start=11, end=15, score=0.8),
RecognizerResult(entity_type="PERSON", start=17, end=27, score=0.8),
],
operators={"PERSON": OperatorConfig("encrypt", {"key":"WmZq4t7w!z%C&F)J"})},
)
print("替换前:", text)
print("替换后:", result.text)
执行结果:
替换前: My name is Bond, James Bond
替换后: My name is QtVyESAGf1LKgHxIDF8XQK5j8l9Bue3+nb/aZP0Croo=, qMkcFpDb9NZpV1BM2NQtf6CiCXD+r5gi27SoAeMqDE4=
custom
用户自定义实现匿名函数并传入。
参数:
- lambda:lamdba函数,要求自定义函数返回字符串。
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import RecognizerResult, OperatorConfig
def fake(x):
return "fake"
text = "My name is Bond, James Bond"
engine = AnonymizerEngine()
result = engine.anonymize(
text= text,
analyzer_results=[
RecognizerResult(entity_type="PERSON", start=11, end=15, score=0.8),
RecognizerResult(entity_type="PERSON", start=17, end=27, score=0.8),
],
operators={"PERSON": OperatorConfig("custom", {"lambda":fake})},
)
print("替换前:", text)
print("替换后:", result.text)
执行结果:
替换前: My name is Bond, James Bond
替换后: My name is fake, fake
keep
不对敏感数据内容更改。
参数:无
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import RecognizerResult, OperatorConfig
text = "My name is Bond, James Bond"
engine = AnonymizerEngine()
result = engine.anonymize(
text= text,
analyzer_results=[
RecognizerResult(entity_type="PERSON", start=11, end=15, score=0.8),
RecognizerResult(entity_type="PERSON", start=17, end=27, score=0.8),
],
operators={"PERSON": OperatorConfig("keep")},
)
print("替换前:", text)
print("替换后:", result.text)
执行结果:
替换前: My name is Bond, James Bond
替换后: My name is Bond, James Bond
去匿名化
decrypt
对已经经过加密的敏感数据执行解密操作。
参数:
- key:一段字符串作为key,此key与加密时key一样。
from presidio_anonymizer import AnonymizerEngine, DeanonymizeEngine
from presidio_anonymizer.entities import RecognizerResult, OperatorConfig
from presidio_anonymizer.operators import Decrypt
text = "My name is Bond, James Bond"
engine = AnonymizerEngine()
result = engine.anonymize(
text= text,
analyzer_results=[
RecognizerResult(entity_type="PERSON", start=11, end=15, score=0.8),
RecognizerResult(entity_type="PERSON", start=17, end=27, score=0.8),
],
operators={"PERSON": OperatorConfig("encrypt", {"key":"WmZq4t7w!z%C&F)J"})},
)
print("替换前:", text)
print("替换后:", result.text)
deanonymize_engine = DeanonymizeEngine()
deanonymized_result = deanonymize_engine.deanonymize(
text=result.text,
entities=result.items,
operators={"DEFAULT": OperatorConfig("decrypt", {"key": "WmZq4t7w!z%C&F)J"})},
)
print("解密后:", deanonymized_result.text)
执行结果:
替换前: My name is Bond, James Bond
替换后: My name is LvF+UsY9reo9dB4uXiS1yzIsM7TcDo1Yozaglz/rzIM=, IVhgF46vR2cwjXAyMa82Y1Go6M9JKq8lal6KxYZVIs8=
解密后: My name is Bond, James Bond
敏感数据重叠
在内容扫描时总会出现敏感内容之间出现重叠或者覆盖的情况,Presidio针对不同的情况执行不同策略。
无重叠
当敏感数据实体之间没有重叠,Presidio匿名器依据给定的匿名配置执行脱敏操作。
数据内容为:
My name is Inigo Montoya. You Killed my Father. Prepare to die. BTW my number is:
03-232323.
假设只有"Inigo"被判定为NAME。
My name is <NAME> Montoya. You Killed my Father. Prepare to die. BTW my number is: 03-232323.
覆盖
当某个字符串符合多个识别器的识别特征时就会发生冲突,敏感数据应该如何进行脱敏处理?Presidio会比较敏感数据类型的置信度,并按照置信度最高类型对应的匿名配置进行匿名操作。假若置信度都相等,则Presidio任意选择一种进行脱敏操作。
数据内容为:
My name is Inigo Montoya. You Killed my Father. Prepare to die. BTW my number is:
03-232323.
假设字符串"03-232323"既被识别成PHONE_NUMBER类型,且得分为0.7,又被识别成SSN类型,且得分为0.6,置信度高者胜出。
My name is Inigo Montoya. You Killed my Father. Prepare to die. BTW my number is: <PHONE_NUMBER>.
包含
某个敏感数据字符串被另一个敏感字符串包含且更长。Presidio将执行最长字符串对应的脱敏操作,忽略两个字符串扫描结果的置信度。
数据内容为:
My name is Inigo Montoya. You Killed my Father. Prepare to die. BTW my number is:
03-232323.
假设"Inigo"被识别成FIRST_NAME类型,并且"Inigo Montoya"被识别成NAME类型,则使用字符串较长的类型对应的脱敏操作:
My name is <NAME>. You Killed my Father. Prepare to die. BTW my number is: 03-232323.
相交
两个敏感数据字符串出现相交,Presidio将会分别执行脱敏操作,并且脱敏结果同时替换原有数据。
数据内容为:
My name is Inigo Montoya. You Killed my Father. Prepare to die. BTW my number is: 03-232323.
假设字符串"03-2323"被识别成PHONE_NUMBER,"232323"被识别成SSN。
My name is Inigo Montoya. You Killed my Father. Prepare to die. BTW my number is: <PHONE_NUMBER><SSN>.
自定义匿名化操作
用户可以根据需求定制开发匿名化操作类型,并集成到匿名模块当中。
内置的匿名函数都在源码下的/presidio_anonymizer/operators文件加下,在这个文件夹下我们可以看到Presidio内置匿名化操作的实现。
用户需要常见自定义的匿名化操作类且继承Operator类型,并重写四个方法,分别是
operate、validate、operator_name和operator_type。其中operate方法完成匿名操作,validate方法用来解析用户传入的配置参数值类型是否合法。operator_name返回该匿名化类型名。operator_type返回固定的OperatorType.Anonymize。
最后将自定义类的名称添加到/presidio_anonymizer/operators/__init__.py中。
下面可以参考replace操作的实现:
class Replace(Operator):
"""Receives new text to replace old PII text entity with."""
NEW_VALUE = "new_value"
def operate(self, text: str = None, params: Dict = None) -> str:
""":return: new_value."""
new_val = params.get(self.NEW_VALUE)
if not new_val:
return f"<{params.get('entity_type')}>"
return new_val
def validate(self, params: Dict = None) -> None:
"""Validate the new value is string."""
validate_type(params.get(self.NEW_VALUE), self.NEW_VALUE, str)
pass
def operator_name(self) -> str:
"""Return operator name."""
return "replace"
def operator_type(self) -> OperatorType:
"""Return operator type."""
return OperatorType.Anonymize