grep 从 $START 到 $END 的一组行并且在 $MIDDLE 中包含匹配项

Question

R 9000

Asked: 2023-03-03 00:48:24 +0800 CST2023-03-03 00:48:24 +0800 CST 2023-03-03 00:48:24 +0800 CST

从街道号码中拆分/提取街道名称的终极工具

772

我有一些不同国际格式的addresses.csv

Example Street 1
Teststraße 2
Teststr. 1-5
Baker Street 221b
221B Baker Street
19th Ave 3B
3B 2nd Ave
1-3 2nd Mount x Ave
105 Lock St # 219
Test Street, 1
BookAve, 54, Extra Text 123#

例如我们在德国写作Teststraße 2，在美国2 Test Street

有没有办法分离/提取所有街道名称和街道号码？ 输出名称.csv

Example Street
Teststraße
Teststr.
Baker Street
Baker Street
19th Ave
2nd Ave
2nd Mount Good Ave
Lock St # 219
Test Street
BookAve

输出数字.csv

输出-extra_text.csv











Extra Text 123#

我正在使用 macOS 13.. shell 是 zsh 5.8.1 或 bash-3.2

我的想法是：您可以像这样先对地址进行排序：

x=The-adress-line;
if [ x = "begins with a letter"];
    then 
    if [ x = "begins with a letter + number + SPACE"];
        then
        echo 'something like "1A Street"';
        # NUMBER = '1A' / NAME = 'Street'
    else
        echo 'It begins with the STREET-NAME';
    fi;
elif [ x = "begins with a number"];
    then
    echo 'maybe STREET-NAME like "19th Ave 19B" or STREET-NUMBER like "19B Street"';
    # NUMBER = '19B' / NAME = '19th Ave' or 'Street'
    if [ x = "begins with a number + SPACE"];
        then
        echo 'It begins with the STREET-NUMBER like "1 Street"';
        # NUMBER = '1' / NAME = 'Street'
    elif [ x = "is (number)(text)(space)(text)(number(maybe-text))"];
        then
            echo 'For example 19th Street 19B -> The last number+text is the number (19B)'
            # NUMBER = '19B' / NAME = '19th Street'
    elif [ x = "is (number(maybe-text))(space)(number)(text)(space)(text)"];
        then
        echo 'For example 19B 19th Street -> The first number+text is the number (19B)'
            # NUMBER = '19B' / NAME = '19th Street'
    else
        echo 'INVALID';
else
    echo 'INVALID';
fi;

1 个回答

Voted

Ed Morton · Answer 1 · 2023-03-03T03:25:54+08:00

恕我直言，您所能做的就是尽最大努力使用一系列正则表达式来匹配您所知道的地址，例如，使用 GNU awk 作为第 3 个 arg 和match()简写\s以及[[:space:]]定义的 3 个可能的正则表达式：

$ cat tst.awk
BEGIN { OFS="\",\"" }
{
    name = number = type = ""
    gsub(/"/,"\"\"")
}
match($0,/^([^0-9]+)([0-9]+(-[0-9]+)?[[:alpha:]]?)$/,a) {
    # Example Street 1
    # Teststraße 2
    # Teststr. 1-5
    # Baker Street 221b
    # Test Street, 1
    type   = 1
    name   = a[1]
    number = a[2]
}
!type && match($0,/^([0-9]+[[:alpha:]])\s+([^0-9]+)$/,a) {
    # 221B Baker Street
    type   = 2
    name   = a[2]
    number = a[1]
}
!type && match($0,/^([0-9]+[[:alpha:]]{2}.*)\s+([0-9]+[[:alpha:]]?)$/,a) {
    # 19th Ave 3B
    type   = 3
    name   = a[1]
    number = a[2]
}
{
    gsub(/^\s+|\s+$/,"",name)
    gsub(/^\s+|\s+$/,"",number)
    if ( !doneHdr++ ) {
        print "\"" "type", "name", "number", "$0" "\""
    }
    print "\"" type, name, number, $0 "\""
}

$ awk -f tst.awk file
"type","name","number","$0"
"1","Example Street","1","Example Street 1"
"1","Teststraße","2","Teststraße 2"
"1","Teststr.","1-5","Teststr. 1-5"
"1","Baker Street","221b","Baker Street 221b"
"2","Baker Street","221B","221B Baker Street"
"3","19th Ave","3B","19th Ave 3B"
"","","","3B 2nd Ave"
"","","","1-3 2nd Mount x Ave"
"","","","105 Lock St # 219"
"1","Test Street,","1","Test Street, 1"
"","","","BookAve, 54, Extra Text 123#"

您将添加其他正则表达式以按适当的顺序匹配您知道的地址格式，这样如果一个地址可能匹配 2 个或更多正则表达式，您首先拥有限制性更强的正则表达式。如果地址匹配 2 个或更多正则表达式，你可能实际上想要修改上面的内容以打印警告，因为你可能想要调整或重新排序或合并它们。

如果你到达仍然print为空的行type，那就是“无效”的情况，然后你可以编写/添加一个新的正则表达式来匹配那些合适的。

我确实希望您会遇到无法编写代码来区分一种地址格式和另一种地址格式的情况，但希望这种尽力而为的方法足以满足您的需求。如果你有城市/州/县，你总是可以使用谷歌地图卷曲一个地址，看看它是否真实，作为你无法识别的地址的最后努力（但如果你试图只做，那将永远这适用于您所有的地址）。

一旦地址识别算法正常工作，就可以随心所欲地生成输出，我只是在上面转储 CSV 以便于开发/测试。

从街道号码中拆分/提取街道名称的终极工具

模块 i915 可能缺少固件 /lib/firmware/i915/*

无法获取 jessie backports 存储库

如何将 GPG 私钥和公钥导出到文件

我们如何运行存储在变量中的命令？

如何配置 systemd-resolved 和 systemd-networkd 以使用本地 DNS 服务器来解析本地域和远程 DNS 服务器来解析远程域？

dist-upgrade 后 Kali Linux 中的 apt-get update 错误 [重复]

如何从 systemctl 服务日志中查看最新的 x 行

Nano - 跳转到文件末尾

grub 错误：你需要先加载内核

如何下载软件包而不是使用 apt-get 命令安装它？

从街道号码中拆分/提取街道名称的终极工具

1 个回答

相关问题