我正在使用 Arch Linux/Debian Linux,想要一个 ASCII txt 文件中的唯一“标识符”列表。以下是我想要缩减的数据片段:
... (Received from VRW): wind ...
... (Received from 1a00): air_ ...
... (Received from 5710): air_ ...
... (Received from ####): air_ ...
... (Received from 15d8): air_ ...
... (Received from ####): air_ ...
... (Received from 6e9e): baro ...
... (Received from 6e9e): volt ...
... (Received from 6e9e): wind ...
... (Received from 6e9e): air_ ...
由于文件很大且有大量重复的“标识符”,我只想输出唯一的标识符,以便输出如下所示:
... (Received from VRW): wind ...
... (Received from 1a00): air_ ...
... (Received from 5710): air_ ...
... (Received from ####): air_ ...
... (Received from 15d8): air_ ...
... (Received from 6e9e): baro ...
更好的做法是简单地列出唯一标识符,例如,,,15d8
等等。但我认为这会困难得多。6e9e
VRW
根据我以前尝试过的类似问题的建议:
grep "(Received from" datafile.txt
并得到了大量的标识符列表,其中大多数是重复的。
我也尝试过:
grep "(Received from" datafile.txt | sort -u
但不能说这是否有任何区别
我也尝试过:
parallel --tag --lb grep "Received from" {} | perl -ne '$seen{$_}++ or print;' ::: Data1.txt
这可能显示了我对这些问题的无知程度。
带有
awk
(适应$4
正确的列):match()
或者使用GNU 实现的扩展awk
和正则表达式:grep
使用支持和 perl 正则表达式的实现-o
(如 GNU,grep
当使用 PCRE(2) 支持构建时)和sort
:与其他的相反,那个会从行中提取所有匹配项。
在 perl 中,可以这样做:
如果行与正则表达式匹配,并且其中第一对所匹配的内容
()
尚未出现,我们将打印这些行。我认为恰恰相反。您只提取标识符,然后很容易创建这样的列表:
sort -u
。处理整行并根据其片段决定是否应省略当前行,这似乎并不难。在这里,我们sed
通过用它们替换整行来提取标识符:笔记:
(Received from …):
,那么只提取第一个标识符。)
不支持包含的标识符。