如何将 for 循环拆分为 3 个单独的数据框？

Question

Leyla Elkhamlichi

Asked: 2024-08-06 02:33:13 +0800 CST2024-08-06 02:33:13 +0800 CST 2024-08-06 02:33:13 +0800 CST

使用正则表达式无法替换单词的函数

772

我有一个脚本，用于匿名化个人数据，因此当字符串中有一些以大写字母开头的单词时，它会用另一个函数替换它们（即匿名化名称）

我想编写一个函数，其中正则表达式正在查找列表中给出的单词。当字符串具有给定列表中的单词之一时，不应替换它。举个例子： Mijn naam is kim en ik heb een opleiding gevolgd aan de Universiteit van Amsterdam

因此，由于Universiteit van Amsterdam是用大写字母书写的，因此将被另一个函数匿名化。我想创建一个使用正则表达式的额外函数，当字符串与列表中的单词匹配时，将忽略包含某些单词的给定列表。
我有一个可以替代它的函数，但我希望匹配的单词被忽略。

这是 anonymizeNames **的函数

def anonymizeNames(sentence):
    '''
        :param sentence: the input sentence
        :return: the sentence without names
    '''

    ##define x
    x = ""

    ##Check naam: indication
    names0Reg = "[Aa]chternaam:|[Vv]oornaam:|[Nn]aam:|[Nn]amen:"
    res = re.search(names0Reg, sentence)
    if res != None:
        ##Achternaam:, voornaam: or naam: or namen: occurs; next Standardize
        sentence = re.sub('[Nn]amen:', 'naam:', sentence)
        sentence = re.sub('[Aa]chternaam:', 'naam:', sentence)
        sentence = re.sub('[Vv]oornaam:', 'naam:', sentence)
        sentence = re.sub('Naam:', 'naam:', sentence)

        ##Extract names
        names00Reg = "naam: [A-Za-z]+"
        x = re.findall(names00Reg, sentence)
        for y in x:
            ##remove naam:\s
            y = re.sub('naam: ', '', y)
            ##Check for tussenvoegsels
            if y in tussenVList:
                ##Add next word
                regTest = y + " " + "[A-Za-z]+"
                x2 = re.search(regTest, sentence)
                if x2 != None:
                    ##Name found
                    y = x2.group()
                    ##replace
                   sentence = re.sub(y, strz, sentence)

    ##Always check sentences for names 1
    names1Reg = "[Ii]k [Bb]en ([A-Z]{1}[a-z ]{2,})+[\\.\\,]*"
    res = re.search(names1Reg, sentence)
    if res != None:
        ##adjust result
        x = re.sub('[Ii]k [Bb]en ', '', res.group())
        x = re.sub('[\\,\\.]', '', x)
        ##use NLP to only keep names
        

    ##Always check sentences for names 2
    names2Reg = "[Mm]ijn [Nn]aam is ([A-Z]{1}[a-z\s-]{2,})+[\\.\\,]*"
    res = re.search(names2Reg, sentence)
    if res != None:
        ##adjust result
        x = re.sub('[Mm]ijn [Nn]aam is ', '', res.group())
        x = re.sub('[\\,\\.]', '', x)
        ##use NLP to only keep names
        

    ##Check for single letter followed by dot and series of letters
    if x == "":
        regNameLet = "^[A-Z]{1}\\.[A-Za-z]{2,}|\s[A-Z]{1}\\.[A-Za-z]{2,}"
        res = re.search(regNameLet, sentence)
        if res != None:
            ##replace word in sentence, first at start
            sentence = re.sub('^[A-Z]{1}\\.[A-Za-z]{2,}', strz, sentence)
            ##next in sentence with additional space
            strY = " " + strz
            sentence = re.sub('\s[A-Z]{1}\\.[A-Za-z]{2,}', strz, sentence)

    ##Check for occurence of two subsequent uppercase words (might be a name)
    if x == "":
        res = re.findall("[A-Z]{1}[a-z]{2,}\s[A-Z]{1}[a-z]{2,}", sentence)
        if res != []:
            for y in res:
                if len(y) > 1:
                    ##replace name with strX
                    sentence = re.sub(y, strz, sentence)

    ##Always recheck remaining sentence with NLP to make sure all personal info is removed
    sentence = pureNLP2(sentence)  ##pureNLP2 tries to include entity checks

    return (sentence)

这是我查找大学名称的功能，我不想用此功能替换它们

school ['Hogenschool Amsterdam', 'Universiteit van Amsterdam']
strX='xxx'

def school (sentence):
   for schoolname in school:
     res = re.findall(schoolname,sentence)
     if res !=[]:
        for y in res:
            if len(y) >1:
               sentence = replaceNice(sentence, strX, y)
      return(sentence)
print(school('Mijn naam is Kim en ik volg een opleiding aan de Universiteit van Amsterdam'))

输出：Mijn naam xxx en ik volg een opleiding aan de xxx xxx

我想要的输出是： Mijn naam is Kim en ik volg een opleiding aan de Universiteit van Amsterdam

我觉得我已经开始写了。但是当我想完成变量句子时，我有点卡住了，因为我想说如果字符串中有来自School列表的匹配单词，则不要替换它，而只是将其打印回来。

1 个回答

Voted

Michael Cao · Answer 1 · 2024-08-06T03:05:39+08:00

Best Answer

Michael Cao

2024-08-06T03:05:39+08:002024-08-06T03:05:39+08:00

将所有安全词替换为小写版本，然后匿名化，然后将小写的安全词恢复为其原始形式。

test_strings = ['Adam goes to Universiteit van Amsterdam', 'George goes to Washington College', 'Anthony Hopkins is a student at Johns Hopkins']
safe_words = ['Universiteit van Amsterdam', 'Johns Hopkins', 'Washington College']

def anonymize(sentence, safe_words):
    restore = {}

    for word in safe_words:
        sentence = sentence.replace(word, word.lower())

        restore[word.lower()] = word
    
    for word in sentence.split():
        if word[0].isupper():
            sentence = sentence.replace(word, word[0]+'.')
    
    for word, restored_word in restore.items():
        sentence = sentence.replace(word, restored_word)
    
    return sentence

for sentence in test_strings:
    print(anonymize(sentence, safe_words))

输出：

A. goes to Universiteit van Amsterdam
G. goes to Washington College
A. H. is a student at Johns Hopkins

1

使用正则表达式无法替换单词的函数

Vue 3：创建时出错“预期标识符但发现‘导入’”[重复]

为什么这个简单而小的 Java 代码在所有 Graal JVM 上的运行速度都快 30 倍，但在任何 Oracle JVM 上却不行？

具有指定基础类型但没有枚举器的“枚举类”的用途是什么？

如何修复未手动导入的模块的 MODULE_NOT_FOUND 错误？

`(表达式，左值) = 右值` 在 C 或 C++ 中是有效的赋值吗？为什么有些编译器会接受/拒绝它？

何时应使用 std::inplace_vector 而不是 std::vector？

在 C++ 中，一个不执行任何操作的空程序需要 204KB 的堆，但在 C 中则不需要

PowerBI 目前与 BigQuery 不兼容：Simba 驱动程序与 Windows 更新有关

AdMob：MobileAds.initialize() - 对于某些设备，“java.lang.Integer 无法转换为 java.lang.String”

我正在尝试仅使用海龟随机和数学模块来制作吃豆人游戏

使用正则表达式无法替换单词的函数

1 个回答

相关问题