根据第一个逗号之前的匹配删除重复行数

Question

DonJ

Asked: 2018-03-25 11:04:36 +0800 CST2018-03-25 11:04:36 +0800 CST 2018-03-25 11:04:36 +0800 CST

在多个文件中搜索匹配行的第 3 列中编号较大的行

772

我有多个文件，其内容类似于：

主文件1：

test01:6733:4370:5342
test02:7776:2018:1001
test03:9865:5632:1429
test04:8477:4757:1890
test05:8019:8860:5298
test06:5602:3100:6995
test07:1445:2850:2755
test08:10924:2562:4867
test09:2575:1884:1611

示例文件2：

test01:8777:1060:9236
test02:1322:1211:10837
test04:3737:10175:5219
test05:8467:8988:9739
test06:7452:3100:2709
test08:4707:9047:10578
test09:8669:2867:8233
test10:8615:10002:7056

示例文件 3：

test01:10957:8172:2472
test02:1401:6160:5894
test03:7245:8934:5725
test04:8477:10106:10069
test05:10769:10381:1102
test06:3605:3713:7695
test08:10924:2562:10568
test09:2913:5628:1305
test10:5501:10293:2319

我想用另一个文件中的一行更新主文件 1 中的每一行，该文件具有相同的第一列，第三列具有所有文件中最大的数字。

只应考虑主文件中的第一列（应忽略其他文件中存在但主文件中不存在的 test##）。

当在其他文件中找到更多行时（在第三列中具有更大但相同的数字），可以使用其中的任何（一个）来更新主文件。

这是我的不是最佳解决方案

$ awk -F: '{print $1,$3}' main|while read a b;do grep ^${a}: main file*|sort -t":" -rnk4|awk -F: -vb=$b '{if($4>b){print $0;next} else {print ($1=="main")? $0 : NULL}}'|head -1;done
file3:test01:10957:8172:2472
file3:test02:1401:6160:5894
file3:test03:7245:8934:5725
file2:test04:3737:10175:5219
file3:test05:10769:10381:1102
file3:test06:3605:3713:7695
main:test07:1445:2850:2755
file2:test08:4707:9047:10578
file3:test09:2913:5628:1305

如何一次处理 awk 中的所有此类文件并在没有 while 循环和命令中的许多管道的情况下完成工作？

更新：@RomanPerekhrest，感谢您的出色代码，如何将 :updated 后缀添加到来自其他文件的所有行？我想要类似的东西：

test01:10957:8172:2472:updated
test02:1401:6160:5894:updated
test03:7245:8934:5725:updated
test04:3737:10175:5219:updated
test05:10769:10381:1102:updated
test06:3605:3713:7695:updated
test07:1445:2850:2755
test08:4707:9047:10578:updated
test09:2913:5628:1305:updated

更新：我有一个新案例，我之前没有预测到，其他文件在 $3 中具有更大的价值，但在 $2 列中也有非数字 - 在这种情况下，应该忽略这样的行（虽然大 3 美元）因为错误价值 2 美元。

为了展示这种情况，使用上面的示例文件，在 file2 的“test09”行中，我将第二列替换为“xxxxx”，现在我有了：

$ grep test09 *
file2:test09:xxxxx:2867:8233
file3:test09:2913:5628:1305
main:test09:2575:1884:1611
$ awk -F':' 'FILENAME != "main"{ if ($2~/^[0-9]+/&&(!($1 in a) || ($3 > a[$1]))) { a[$1]=$3; b[$1]=$0 } next }{ if (($1 in a) && (a[$1] > $3)){ print b[$1]":updated"; delete b[$1] } else print  }' file* main
test01:10957:8172:2472:updated
test02:1401:6160:5894:updated
test03:7245:8934:5725:updated
test04:3737:10175:5219:updated
test05:10769:10381:1102:updated
test06:3605:3713:7695:updated
test07:1445:2850:2755
test08:4707:9047:10578:updated
test09:2913:5628:1305:updated <- this is now update from file3

接下来，我将 file3 中“test09”行的 $2 值也更改为非数字：

$ grep test09 *
file2:test09:xxxxx:2867:8233
file3:test09:zzzzz:5628:1305
main:test09:2575:1884:1611
$ awk -F':' 'FILENAME != "main"{ if ($2~/^[0-9]+/&&(!($1 in a) || ($3 > a[$1]))) { a[$1]=$3; b[$1]=$0 } next }{ if (($1 in a) && (a[$1] > $3)){ print b[$1]":updated"; delete b[$1] } else print  }' file* main
test01:10957:8172:2472:updated
test02:1401:6160:5894:updated
test03:7245:8934:5725:updated
test04:3737:10175:5219:updated
test05:10769:10381:1102:updated
test06:3605:3713:7695:updated
test07:1445:2850:2755
test08:4707:9047:10578:updated
test09:2575:1884:1611 <-- this is now from the main file

虽然它似乎工作正常，但请来解释一下代码中的第二个“if”吗？它也需要条件$2~/^[0-9]+/吗？

{ if (($1 in a) && (a[$1] > $3))

1 个回答

Voted

RomanPerekhrest · Answer 1 · 2018-03-25T12:25:02+08:00

优化后awk的解决方案，速度提高了约27倍：

awk -F':' 'FILENAME != "main"{ 
               if (!($1 in a) || $3 > a[$1]) { a[$1] = $3; b[$1] = $0 } next; 
           }
           { 
               if (($1 in a) && (a[$1] > $3)){ print b[$1]; delete b[$1] } 
               else print; 
           }' file* main

输出：

test01:10957:8172:2472
test02:1401:6160:5894
test03:7245:8934:5725
test04:3737:10175:5219
test05:10769:10381:1102
test06:3605:3713:7695
test07:1445:2850:2755
test08:4707:9047:10578
test09:2913:5628:1305

执行时间比较：

$ time(awk -F: '{print $1,$3}' main |while read a b; do grep ^${a}: main file* | sort -t":" -rnk4 | awk -F':' -vb=$b '{if($4>b){print $0;next} else {print ($1=="main")? $0 : NULL}}' | head -1; done > /dev/null)

real    0m0.111s
user    0m0.004s
sys 0m0.012s

$ time(awk -F':' 'FILENAME != "main"{ if (!($1 in a) || $3 > a[$1]) { a[$1]=$3; b[$1]=$0 } next }{ if (($1 in a) && (a[$1] > $3)){ print b[$1]; delete b[$1] } else print  }' file* main > /dev/null)

real    0m0.004s
user    0m0.000s
sys 0m0.000s

在多个文件中搜索匹配行的第 3 列中编号较大的行

如何将 GPG 私钥和公钥导出到文件

ssh 无法协商：“找不到匹配的密码”，正在拒绝 cbc

我们如何运行存储在变量中的命令？

如何配置 systemd-resolved 和 systemd-networkd 以使用本地 DNS 服务器来解析本地域和远程 DNS 服务器来解析远程域？

如何卸载内核模块“nvidia-drm”？

dist-upgrade 后 Kali Linux 中的 apt-get update 错误 [重复]

如何从 systemctl 服务日志中查看最新的 x 行

Nano - 跳转到文件末尾

grub 错误：你需要先加载内核

如何下载软件包而不是使用 apt-get 命令安装它？

在多个文件中搜索匹配行的第 3 列中编号较大的行

1 个回答

相关问题