我有一个脚本,用于匿名化个人数据,因此当字符串中有一些以大写字母开头的单词时,它会用另一个函数替换它们(即匿名化名称)
我想编写一个函数,其中正则表达式正在查找列表中给出的单词。当字符串具有给定列表中的单词之一时,不应替换它。举个例子: Mijn naam is kim en ik heb een opleiding gevolgd aan de Universiteit van Amsterdam
因此,由于Universiteit van Amsterdam是用大写字母书写的,因此将被另一个函数匿名化。我想创建一个使用正则表达式的额外函数,当字符串与列表中的单词匹配时,将忽略包含某些单词的给定列表。
我有一个可以替代它的函数,但我希望匹配的单词被忽略。
这是 anonymizeNames **的函数
def anonymizeNames(sentence):
'''
:param sentence: the input sentence
:return: the sentence without names
'''
##define x
x = ""
##Check naam: indication
names0Reg = "[Aa]chternaam:|[Vv]oornaam:|[Nn]aam:|[Nn]amen:"
res = re.search(names0Reg, sentence)
if res != None:
##Achternaam:, voornaam: or naam: or namen: occurs; next Standardize
sentence = re.sub('[Nn]amen:', 'naam:', sentence)
sentence = re.sub('[Aa]chternaam:', 'naam:', sentence)
sentence = re.sub('[Vv]oornaam:', 'naam:', sentence)
sentence = re.sub('Naam:', 'naam:', sentence)
##Extract names
names00Reg = "naam: [A-Za-z]+"
x = re.findall(names00Reg, sentence)
for y in x:
##remove naam:\s
y = re.sub('naam: ', '', y)
##Check for tussenvoegsels
if y in tussenVList:
##Add next word
regTest = y + " " + "[A-Za-z]+"
x2 = re.search(regTest, sentence)
if x2 != None:
##Name found
y = x2.group()
##replace
sentence = re.sub(y, strz, sentence)
##Always check sentences for names 1
names1Reg = "[Ii]k [Bb]en ([A-Z]{1}[a-z ]{2,})+[\\.\\,]*"
res = re.search(names1Reg, sentence)
if res != None:
##adjust result
x = re.sub('[Ii]k [Bb]en ', '', res.group())
x = re.sub('[\\,\\.]', '', x)
##use NLP to only keep names
##Always check sentences for names 2
names2Reg = "[Mm]ijn [Nn]aam is ([A-Z]{1}[a-z\s-]{2,})+[\\.\\,]*"
res = re.search(names2Reg, sentence)
if res != None:
##adjust result
x = re.sub('[Mm]ijn [Nn]aam is ', '', res.group())
x = re.sub('[\\,\\.]', '', x)
##use NLP to only keep names
##Check for single letter followed by dot and series of letters
if x == "":
regNameLet = "^[A-Z]{1}\\.[A-Za-z]{2,}|\s[A-Z]{1}\\.[A-Za-z]{2,}"
res = re.search(regNameLet, sentence)
if res != None:
##replace word in sentence, first at start
sentence = re.sub('^[A-Z]{1}\\.[A-Za-z]{2,}', strz, sentence)
##next in sentence with additional space
strY = " " + strz
sentence = re.sub('\s[A-Z]{1}\\.[A-Za-z]{2,}', strz, sentence)
##Check for occurence of two subsequent uppercase words (might be a name)
if x == "":
res = re.findall("[A-Z]{1}[a-z]{2,}\s[A-Z]{1}[a-z]{2,}", sentence)
if res != []:
for y in res:
if len(y) > 1:
##replace name with strX
sentence = re.sub(y, strz, sentence)
##Always recheck remaining sentence with NLP to make sure all personal info is removed
sentence = pureNLP2(sentence) ##pureNLP2 tries to include entity checks
return (sentence)
这是我查找大学名称的功能,我不想用此功能替换它们
school ['Hogenschool Amsterdam', 'Universiteit van Amsterdam']
strX='xxx'
def school (sentence):
for schoolname in school:
res = re.findall(schoolname,sentence)
if res !=[]:
for y in res:
if len(y) >1:
sentence = replaceNice(sentence, strX, y)
return(sentence)
print(school('Mijn naam is Kim en ik volg een opleiding aan de Universiteit van Amsterdam'))
输出 :Mijn naam xxx en ik volg een opleiding aan de xxx xxx
我想要的输出是:
Mijn naam is Kim en ik volg een opleiding aan de Universiteit van Amsterdam
我觉得我已经开始写了。但是当我想完成变量句子时,我有点卡住了,因为我想说如果字符串中有来自School列表的匹配单词,则不要替换它,而只是将其打印回来。
将所有安全词替换为小写版本,然后匿名化,然后将小写的安全词恢复为其原始形式。
输出: