重新排列字母并比较两个单词

Question

mindlessgreen

Asked: 2018-01-21 02:19:10 +0800 CST2018-01-21 02:19:10 +0800 CST 2018-01-21 02:19:10 +0800 CST

在两个文件中查找线的交点[重复]

772

如果我有两个文件（单列），一个像这样（file1）

第二个文件（file2）

如何找到两个文件（交集）中共有的元素？此示例中的预期输出为

67
102

请注意，每个文件中的项目（行）数不同。数字和字符串可以混合使用。它们可能不一定是排序的。每个项目只出现一次。

更新：

根据以下一些答案进行时间检查。

# generate some data
>shuf -n2000000 -i1-2352452 > file1
>shuf -n2000000 -i1-2352452 > file2

#@ilkkachu
>time (join <(sort "file1") <(sort "file2") > out1)
real    0m15.391s
user    0m14.896s
sys     0m0.205s

>head out1
1
10
100
1000
1000001

#@Hauke
>time (grep -Fxf "file1" "file2" > out2)
real    0m7.652s
user    0m7.131s
sys     0m0.316s

>head out2
1047867
872652
1370463
189072
1807745

#@Roman
>time (comm -12 <(sort "file1") <(sort "file2") > out3)
real    0m13.533s
user    0m13.140s
sys     0m0.195s

>head out3
1
10
100
1000
1000001

#@ilkkachu
>time (awk 'NR==FNR { lines[$0]=1; next } $0 in lines' "file1" "file2" > out4)
real    0m4.587s
user    0m4.262s
sys     0m0.195s

>head out4
1047867
872652
1370463
189072
1807745

#@Cyrus   
>time (sort file1 file2 | uniq -d > out8)
real    0m16.106s
user    0m15.629s
sys     0m0.225s

>head out8
1
10
100
1000
1000001


#@Sundeep
>time (awk 'BEGIN{while( (getline k < "file1")>0 ){a[k]}} $0 in a' file2 > out5)
real    0m4.213s
user    0m3.936s
sys     0m0.179s

>head out5
1047867
872652
1370463
189072
1807745

#@Sundeep
>time (perl -ne 'BEGIN{ $h{$_}=1 while <STDIN> } print if $h{$_}' <file1 file2 > out6)
real    0m3.467s
user    0m3.180s
sys     0m0.175s

>head out6
1047867
872652
1370463
189072
1807745

perl 版本最快，awk 紧随其后。所有输出文件都具有相同的行数。

为了比较，我对输出进行了数字排序，以便输出相同。

#@ilkkachu
>time (join <(sort "file1") <(sort "file2") | sort -k1n > out1)
real    0m17.953s
user    0m5.306s
sys     0m0.138s

#@Hauke
>time (grep -Fxf "file1" "file2" | sort -k1n > out2)
real    0m12.477s
user    0m11.725s
sys     0m0.419s

#@Roman
>time (comm -12 <(sort "file1") <(sort "file2") | sort -k1n > out3)
real    0m16.273s
user    0m3.572s
sys     0m0.102s

#@ilkkachu
>time (awk 'NR==FNR { lines[$0]=1; next } $0 in lines' "file1" "file2" | sort -k1n > out4)
real    0m8.732s
user    0m8.320s
sys     0m0.261s

#@Cyrus   
>time (sort file1 file2 | uniq -d > out8)
real    0m19.382s
user    0m18.726s
sys     0m0.295s

#@Sundeep
>time (awk 'BEGIN{while( (getline k < "file1")>0 ){a[k]}} $0 in a' file2 | sort -k1n > out5)
real    0m8.758s
user    0m8.315s
sys     0m0.255s

#@Sundeep
>time (perl -ne 'BEGIN{ $h{$_}=1 while <STDIN> } print if $h{$_}' <file1 file2 | sort -k1n > out6)
real    0m7.732s
user    0m7.300s
sys     0m0.310s

>head out1
1
2
3
4
5

现在所有输出都相同。

5 个回答

Voted

RomanPerekhrest · Answer 1 · 2018-01-21T02:34:53+08:00

RomanPerekhrest

2018-01-21T02:34:53+08:002018-01-21T02:34:53+08:00

简单comm+sort解决方案：

comm -12 <(sort file1) <(sort file2)

-12- 抑制列1和2（分别对FILE1和唯一的FILE2行），因此只输出公共行（出现在两个文件中）

32

ilkkachu · Answer 2 · 2018-01-21T02:30:33+08:00

Best Answer

ilkkachu

2018-01-21T02:30:33+08:002018-01-21T02:30:33+08:00

在awk中，这会将第一个文件完全加载到内存中：

$ awk 'NR==FNR { lines[$0]=1; next } $0 in lines' file1 file2 
67
102

或者，如果您想跟踪给定行出现的次数：

$ awk 'NR==FNR { lines[$0] += 1; next } lines[$0] {print; lines[$0] -= 1}' file1 file2

join可以这样做，尽管它确实需要对输入文件进行排序，所以你需要先这样做，这样做会丢失原始顺序：

$ join <(sort file1) <(sort file2)
102
67

14

Hauke Laging · Answer 3 · 2018-01-21T02:28:59+08:00

Hauke Laging

2018-01-21T02:28:59+08:002018-01-21T02:28:59+08:00

awk

awk 'NR==FNR { p[NR]=$0; next; }
   { for(val in p) if($0==p[val]) { delete p[val]; print; } }' file1 file2

这是一个很好的解决方案，因为（对于大文件）它应该是最快的，因为它省略了多次打印相同的条目并在匹配后再次检查条目。

grep

grep -Fxf file1 file2

如果它在中多次出现，这将多次输出相同的条目file2。

种类

为了好玩（应该比慢得多grep）：

sort -u file1 >t1
sort -u file2 >t2
sort t1 t2 | uniq -d

7

Cyrus · Answer 4 · 2018-01-21T03:18:30+08:00

Cyrus

2018-01-21T03:18:30+08:002018-01-21T03:18:30+08:00

使用 GNU uniq：

sort file1 file2 | uniq -d

输出：

102
67

3

Sundeep · Answer 5 · 2018-01-21T05:10:44+08:00

Sundeep

2018-01-21T05:10:44+08:002018-01-21T05:10:44+08:00

略有不同的awk版本和等效perl版本

连续三个运行报告的时间

$ # just realized shuf -n2000000 -i1-2352452 can be used too ;)
$ shuf -i1-2352452 | head -n2000000 > f1
$ shuf -i1-2352452 | head -n2000000 > f2

$ time awk 'NR==FNR{a[$1]; next} $0 in a' f1 f2 > t1
real    0m3.322s
real    0m3.094s
real    0m3.029s

$ time awk 'BEGIN{while( (getline k < "f1")>0 ){a[k]}} $0 in a' f2 > t2
real    0m2.731s
real    0m2.777s
real    0m2.801s

$ time perl -ne 'BEGIN{ $h{$_}=1 while <STDIN> } print if $h{$_}' <f1 f2 > t3
real    0m2.643s
real    0m2.690s
real    0m2.630s

$ diff -s t1 t2
Files t1 and t2 are identical
$ diff -s t1 t3
Files t1 and t3 are identical

$ du -h f1 f2 t1
15M f1
15M f2
13M t1

3

在两个文件中查找线的交点[重复]

更新：

如何将 GPG 私钥和公钥导出到文件

ssh 无法协商：“找不到匹配的密码”，正在拒绝 cbc

我们如何运行存储在变量中的命令？

如何配置 systemd-resolved 和 systemd-networkd 以使用本地 DNS 服务器来解析本地域和远程 DNS 服务器来解析远程域？

如何卸载内核模块“nvidia-drm”？

dist-upgrade 后 Kali Linux 中的 apt-get update 错误 [重复]

如何从 systemctl 服务日志中查看最新的 x 行

Nano - 跳转到文件末尾

grub 错误：你需要先加载内核

如何下载软件包而不是使用 apt-get 命令安装它？

在两个文件中查找线的交点[重复]

更新：

5 个回答

相关问题