如果未引用 -name 后面的模式，则 find 的奇怪行为

Question

vume

Asked: 2022-07-13 02:02:48 +0800 CST2022-07-13 02:02:48 +0800 CST 2022-07-13 02:02:48 +0800 CST

是否有工具或脚本可以通过仅比较文件大小和文件内容的一小部分来快速找到重复项？

772

在处理 jpg 或 h264 压缩文件时，像 fdupes 这样的工具是荒谬的。两个具有完全相同文件大小的此类文件已经很好地表明它们是相同的。

比如说，除此之外，如果提取并比较了 16 个 16 字节的等距块并且它们也是相同的，那将有大量证据让我假设它们是相同的。有没有类似的东西？

（顺便说一句，我知道仅文件大小可能是一个相当不可靠的指标，因为可以选择压缩到某些目标大小，例如 1MB 或 1 CD/DVD。如果在许多文件上使用相同的目标大小，这是非常合理的一些不同的文件将具有完全相同的大小。）

7 个回答

Voted

A.L · Answer 1 · 2022-07-13T15:21:40+08:00

czkawka是一个开源工具，用于查找重复文件（以及图像、视频或音乐）并通过命令行或图形界面呈现它们，重点是速度。您可能会对文档中的这一感兴趣：

更快地扫描大量重复项

默认情况下，对按相同大小分组的所有文件计算部分哈希（每个文件仅 2KB 的哈希）。这种哈希计算通常非常快，尤其是在 SSD 和快速多核处理器上。但是，当使用 HDD 或慢速处理器扫描数十万或数百万个文件时，通常此步骤可能需要很长时间。

使用 GUI 版本，哈希将存储在缓存中，以便以后搜索重复项会更快。

例子：

创建一些测试文件：

我们生成随机图像，然后复制a.jpg到b.jpg以进行复制。

$ convert -size 1000x1000 plasma:fractal a.jpg
$ cp -v a.jpg b.jpg
'a.jpg' -> 'b.jpg'
$ convert -size 1000x1000 plasma:fractal c.jpg
$ convert -size 1000x1000 plasma:fractal d.jpg
$ ls --size
total 1456
364 a.jpg  364 b.jpg  364 c.jpg  364 d.jpg

只检查大小：

$ linux_czkawka_cli dup --directories /run/shm/test/ --search-method size
Found 2 files in 1 groups with same size(may have different content) which took 361.76 KiB:
Size - 361.76 KiB (370442) - 2 files 
/run/shm/test/b.jpg
/run/shm/test/a.jpg

通过哈希值检查文件：

$ linux_czkawka_cli dup --directories /run/shm/test/ --search-method hash
Found 2 duplicated files in 1 groups with same content which took 361.76 KiB:
Size - 361.76 KiB (370442) - 2 files 
/run/shm/test/b.jpg
/run/shm/test/a.jpg

通过将文件分析为图像来检查文件：

$ linux_czkawka_cli image --directories /run/shm/test/
Found 1 images which have similar friends
/run/shm/test/a.jpg - 1000x1000 - 361.76 KiB - Very High
/run/shm/test/b.jpg - 1000x1000 - 361.76 KiB - Very High

Peter Cordes · Answer 2 · 2022-07-13T19:48:38+08:00

您可能希望确保对第一个和最后一个 1MiB 左右进行完整比较（或哈希），其中元数据可以存在，可以在不向压缩数据引入偏移量的情况下进行编辑。此外，从存储中读取的粒度通常至少为 512 个字节而不是 16 个字节，所以不妨这样做；一点点额外的 CPU 时间来比较更多数据是微不足道的。（以 512 字节边界对齐）

(A write sector size of at least 4096B is typical, but a logical sector size of 512 might allow a SATA disk to only send the requested 512B over the wire, if the kernel doesn't widen the request to a full page itself. Which it probably would; the pagecache is managed in whole pages.)

Keep in mind that bit-rot is possible, especially if files have been stored on DVD-R or other optical media. I wouldn't delete a "duplicate" without checking for bitwise identical (or at least identical hashes). Ruling out duplicates quickly based on a hash signature of an early part of a file is useful, but you'd still want to do a full check before declaring two files duplicates for most purposes.

If two files are almost the same but have a few bit-differences, use ffmpeg -i foo.mp4 -f null - to find glitches, decoding but doing nothing with the output.

If you do find a bitwise difference but neither file has errors a decoder notices, use

ffmpeg -i foo.mp4 -f framecrc  foo.fcrc

or -f framemd5 to see which frame has a difference that wasn't an invalid h.264 stream. Then seek to there and visually inspect which one is corrupt.

Your method could be good for detecting files that are corrupt (or metadata-edited) copies of each other, something that normal duplicate-finders won't do easily. Comments under the question point out that jdupes can use hashes of the first N megabytes of a file after a size compare, so that's a step in the right direction.

For other use-cases, maybe you'd be ok with less stringent checking, but given that duplicate file finders exist that only compare or hash when there are files of identical size, you can just let one of those run (overnight or while you're going out), and come back to a fully checked list.

Some like fslint have the option to hard-link duplicates to each other (or symlink), so next time you look for duplicates, they'll already be the same file. So in my experience, duplicate file finding is not something where I've felt a need to take a faster but risky approach.

(fslint never got updated for Python3, apparently czkawka is a modern clone in Rust, according to an askubuntu answer.)

Philippos · Answer 3 · 2022-07-13T04:45:23+08:00

Philippos

2022-07-13T04:45:23+08:002022-07-13T04:45:23+08:00

GNU对你有cmp帮助吗？

您可以使用该-s选项来抑制输出并仅使用返回值
它首先检查文件大小以跳过对不同文件大小的任何比较
使用选项-i（跳过初始）和-n（要比较的字节数），您可以另外定义要比较的字节范围

如果文件的数量对于每对文件来说都太大cmp，您可能希望首先sort按文件大小排列所有文件，然后只比较大小相同的组（uniq -Dwith -w）。

10

sudodus · Answer 4 · 2022-07-13T04:38:07+08:00

OP、@vume、idea 的 Shellscript 实现

背景与示例`rsync`

看看rsync。它有几个级别的检查文件是否相同。该手册man rsync非常详细，您可以识别我所描述的内容，并且可能还有一些其他有趣的替代方案。

最严格的检查是比较每个字节，但是在你写的时候，如果有很多数据，例如一个完整的备份，它会花费很多时间。
```
-c, --checksum              skip based on checksum, not mod-time & size
-a, --archive               archive mode; equals -rlptgoD (no -H,-A,-X)
```
标准检查是大小和其他文件属性（例如时间戳）。它通常被认为足够好。
你的想法，@vume，意味着这两个检查级别之间的东西。我还没有见过这样的工具，但我会对这样的工具非常感兴趣。

编辑1：shellscript`vumer`

以下 shellscriptvumer用于dd执行我认为您想要的操作，@vume。

#!/bin/bash

chunks=16
chnksiz=16

function usage {
 echo "Usage: ${0##*/} <file>"
}

if ! test -f "$1"
then
 usage
 exit
fi

size=$(stat --format='%s' "$1")
step=$(( size/(chunks-1) ))
#echo "step=$step"
tmpfil=$(mktemp)

if [ $size -lt 512 ]
then
 chksum=$(md5sum "$1" | sed 's/ .*//')
else
 for (( i=0;i<chunks;i++ ))
 do
  if [ $i -eq $((chunks-1)) ]
  then
   pos=$((size-chnksiz))
  else
   pos=$((i*step))
  fi
#  echo "$i: $pos"
  dd if="$1" bs=1 skip=$pos count=16 >> "$tmpfil" 2> /dev/null
 done
 chksum=$(md5sum "$tmpfil" | sed 's/ .*//')
fi

modif=$(stat --format='%y' "$1")
modif=${modif//\ /_}
echo "size=$size modif=$modif checksum=$chksum file=\"$1\""
#less "$tmpfil"
rm "$tmpfil"

在我的 Lenovo C30 工作站（旧但功能强大）中，我vumer使用 Ubuntu Desktop 22.04 LTS iso 文件进行了测试，并比较了使用的时间md5sum，

$ time vumer ubuntu-22.04-desktop-amd64.iso
size=3654957056 modif=2022-04-19_10:25:02.000000000_+0200 checksum=ec3483153bfb965745753c4b1b92bf2e file="ubuntu-22.04-desktop-amd64.iso"

real    0m0,024s
user    0m0,018s
sys 0m0,008s

$ time md5sum ubuntu-22.04-desktop-amd64.iso
7621da10af45a031ea9a0d1d7fea9643  ubuntu-22.04-desktop-amd64.iso

real    0m19,919s
user    0m5,331s
sys 0m0,988s

所以对于大文件来说，它确实比md5sum今天被认为是[太]简单的校验和工具要快得多。sha256sum甚至更慢。

我还检查了一个 Debian iso 文件，该文件被转换为用其原始文件替换了几个引导选项quiet splash，persistence 并与它的原始文件进行了比较。vumer运气不好，没有检查修改的几个位置。所以在这里我们必须依靠经典的时间戳来区分。当然md5sum能分辨出来。

$ vumer debian-live-10.0.0-amd64-standard.iso
size=865075200 modif=2019-09-05_10:14:31.000000000_+0200 checksum=b8af03a946fb400ca66f2bdb2a6bb628 file="debian-live-10.0.0-amd64-standard.iso"

$ vumer persistent-debian-live-10.0.0-amd64-standard.iso
size=865075200 modif=2019-09-12_18:01:55.000000000_+0200 checksum=b8af03a946fb400ca66f2bdb2a6bb628 file="persistent-debian-live-10.0.0-amd64-standard.iso"

$ md5sum  *debian-live-10.0.0-amd64-standard.iso
a64ae643520ca0edcbcc769bae9498f3  debian-live-10.0.0-amd64-standard.iso
574ac1f29a6c86d16353a54f6aa8ea1c  persistent-debian-live-10.0.0-amd64-standard.iso

因此，这取决于您拥有什么样的文件，以及如何修改它们，vumer以及类似的工具是否有用。

编辑2：扫描目录树的“oneliner”

这是扫描目录树的“oneliner”

md5sum $(for i in $(find -type f -ls|sed 's/^ *//'|tr -s ' ' '\t     '| \
cut -f 7,11|sort -n|sed 's/\t/\t     /'|uniq -Dw10|sed 's/\t */\t/'| \
cut -f 2 );do vumer "$i";done|sed -e 's/.*checksum=//'|sort|rev| \
uniq -f 1 -D|rev|tee /dev/stderr|sed -e 's/.*file=//' -e 's/"//g')|uniq -Dw32

目录树中有 418 个文件（包含 iso 文件和其他一些文件）
检查确定的 72 个文件的大小（可能是 36 对）
vumer识别出 30 个文件（15 对）具有相同的 vumer 校验和
md5sum识别出具有相同 md5sum 校验和的 18 个文件（9 对）

这意味着vumer节省了大量时间；md5sum只需检查 418 个文件中的 30 个。

编辑 3：shellscript`scan4dblt`

我用一个脚本替换了“oneliner” scan4dblt，我还在一些目录树中测试了它，并对“doer”脚本进行了一些编辑vumer。

#!/bin/bash

function mkfil {

tmpf0=$(mktemp)
tmpf1=$(mktemp)
tmpf2=$(mktemp)
tmpf3=$(mktemp)
tmpf4=$(mktemp)
tmpf5=$(mktemp)
}

function rmfil {

rm "$tmpf0" "$tmpf1" "$tmpf2" "$tmpf3" "$tmpf4" "$tmpf5"
}

# main

echo -n "${0##*/}: "
mkfil

# same size

find -type f -printf "%s %p\n" \
|sed 's/ /        /'|sort -n|uniq -Dw11|tee "$tmpf0" |tr -s ' ' ' ' \
|cut -d ' ' -f 2- |sed -e 's/&/\&/g' -e 's/(/\(/g' -e 's/)/\)/g' > "$tmpf1"

res=$(wc -l "$tmpf0"|sed 's/ .*//')
if [ $res -gt 1 ]
then
 echo "same size ($res):"
 cat "$tmpf0"
else
 echo "no files with the same size"
 rmfil
 exit 1
fi

# vumer

while read fnam
do
 echo -n '.'
 vumer "$fnam" >> "$tmpf2"
# echo -e "$fnam: $(vumer "$fnam")" >> "$tmpf2"
done < "$tmpf1"
echo ''
sed "$tmpf2" -e 's/.*checksum=//'|sort|uniq -Dw33|tee "$tmpf1" |sed -e 's/.*file=//' -e 's/"//g' > "$tmpf3"

res=$(wc -l "$tmpf1"|sed 's/ .*//')
if [ $res -gt 1 ]
then
 echo "vumer: ($res)"
 cat "$tmpf1"
else
 echo "vumer: no files with the same checksum"
 rmfil
 exit 1
fi

# inode

while read fnam
do
 echo -n '.'
 size=$(stat --format='%s' "$fnam")
 inod=$(stat --format='%i' "$fnam")
 printf "%12d %10d %s\n" $size $inod "$fnam" >> "$tmpf4"
done < "$tmpf3"
echo ''
#cat "$tmpf4"
#cat "$tmpf4" |sort -k3|uniq -f2 -Dw32 |sort -n > "$tmpf5"
cat "$tmpf4" |sort -k2|uniq -f1 -Dw11 |sort -n > "$tmpf5"

> "$tmpf2"
while read fnam
do
 if ! grep "$fnam" "$tmpf5" 2>&1 > /dev/null
 then
  echo "$fnam" >> "$tmpf2"
 fi
done < "$tmpf3"

res=$(wc -l "$tmpf5"|sed 's/ .*//')
if [ $res -gt 1 ]
then
 echo "inode: ($res)"
 echo "        size     inode              md5sum                 file-name"
 cat "$tmpf5"
else
 echo "inode: no files with the same inode"
fi

# md5sum

> "$tmpf4"
while read fnam
do
 echo -n '.'
 size=$(stat --format='%s' "$fnam")
 inod=$(stat --format='%i' "$fnam")
 printf "%12d %10d " $size $inod >> "$tmpf4"
 md5sum "$fnam" >> "$tmpf4"
done < "$tmpf2"
echo ''
#echo "4: ";cat "$tmpf4"
cat "$tmpf4" |sort -k3|uniq -f2 -Dw33 |sort -n > "$tmpf5"

res=$(wc -l "$tmpf5"|sed 's/ .*//')
if [ $res -gt 1 ]
then
 echo "md5sum: ($res)"
 echo "        size     inode              md5sum                 file-name"
 cat "$tmpf5"
else
 echo "md5sum: no files with the same checksum"
 rmfil
 exit 1
fi

rmfil

编辑 4：改进的 shellscript`scan4dblt`加示例（输出文件）

shellscriptscan4dblt进一步开发并使用一些目录树进行测试，包括大的 iso 文件、图片、视频剪辑和文档。修复了几个错误（当前版本在这里替换了原始版本）。

例子：

以下示例显示了由生成的输出文件

scan4dblt | tee /tmp/scan4dblt.out

Pastebin 与结果文件：scan4dblt.out

即使一小部分文件由完全检查md5sum，完全检查使用了大部分执行时间。时间的比例md5sum将取决于文件大小。

特别是当有很多相对较小的文件时，这种通过 shellscripts 实现的效率会很低，编译后的程序会好得多。但是对于大文件，例如 iso 文件和视频剪辑，shellscripts 可能会做得很好。

编辑 5：关于 shellscripts 的附加评论

如果我再次进行此练习，我将首先因硬链接而分别保存双峰，并在列表中保留一个剩余的 [硬链接] 文件，以检查它是否与以后的比较中的另一个文件匹配。

测试应该检查多大的数据块以便[vumer这里调用的工具]做得好也很有趣。这可能必须为要检查重复的文件类型量身定制。

我还将测试哪个文件大小，它在中间检查 [by vumer] 中很有用。

最终（？）评论

我很高兴注意到这个问题得到了多少关注，包括答案和评论。正如 Peter Cordes 在他的回答（以及评论中）中所写的那样，快速测试工具（在我的例子中vumer）可以根据要测试的文件类型以多种方式进行改进。

在我的回答中，我只实现了@vume 的原始想法，并且可以证明它在许多情况下与其他快速排序方法结合使用以最大限度地减少对完整校验和测试的需求。

jpa · Answer 5 · 2022-07-14T09:35:06+08:00

jpa

2022-07-14T09:35:06+08:002022-07-14T09:35:06+08:00

There is a tool called imosum that works similar to e.g. sha256sum, but it only uses three 16 kB blocks. The samples are taken from beginning, middle and end of the file, and file size is included in the hash also.

Example usage for finding duplicates:

pip install imohash
find .../your_path -type f -exec imosum {} + > /tmp/hashes
sort /tmp/hashes | uniq -w 32 --all-repeated=separate

Output will have groups of duplicate files:

e8e7d502a5407e75dc1b856024f2c9aa  path/to/file1
e8e7d502a5407e75dc1b856024f2c9aa  other/path/to/duplicate1

e9c83ccdc726d0ec83c55c72ea151d48  path/to/file2
e9c83ccdc726d0ec83c55c72ea151d48  other/path/to/duplicate2

On my SSD, this took about 10 seconds to process 72 GB worth of digital photos (10k files).

5

Romeo Ninov · Answer 6 · 2022-07-13T03:34:29+08:00

Romeo Ninov

2022-07-13T03:34:29+08:002022-07-13T03:34:29+08:00

在处理比较文件时，我常用的工具是使用hash. 例如：

sha1sum -- * |sort >output_file

将创建哈希并对它们进行排序，以便您可以在文件中看到重复项。

这使文件与前几个字节相同的置信度要高得多。

3

martlin · Answer 7 · 2022-07-13T14:41:20+08:00

作为disketo工具的作者，我可以推荐：https ://github.com/martlin2cz/disketo

克隆它，然后运行：

$ cd disketo
$ bin/run.pl scripts/files-with-duplicities.ds YOUR_DIRECTORY

它将在每一行上输出一个文件路径，该路径至少有一个重复的文件（具有相同的名称），因此具有所有这些重复项的路径（由 TAB 分隔）。

您可以自定义搜索。代替预先安装的“files-with- duplicities.ds”提供自定义磁盘脚本。不仅要比较文件名，还要比较大小，请使用ds文件：

filter files having at-least-one of-the-same name-and-size
print files with files-of-the-same name-and-size

如果您希望基于其他内容（即文件内容的一些 16 字节块）进行比较，请使用自定义子：

group files by-custom-groupper sub {
        my ($file, $context) = @_;
        # TODO implement properly
        use File::Slurp;
        return read_file($file);
} as-meta "same-file-contents"
filter files having at-least-one of-the-same custom-group "same-file-contents"

print files with files-of-the-same custom-group "same-file-contents"

或者打开一个问题，我可以添加它。

是否有工具或脚本可以通过仅比较文件大小和文件内容的一小部分来快速找到重复项？

OP、@vume、idea 的 Shellscript 实现

背景与示例`rsync`

编辑1：shellscript`vumer`

编辑2：扫描目录树的“oneliner”

编辑 3：shellscript`scan4dblt`

编辑 4：改进的 shellscript`scan4dblt`加示例（输出文件）

编辑 5：关于 shellscripts 的附加评论

最终（？）评论

模块 i915 可能缺少固件 /lib/firmware/i915/*

无法获取 jessie backports 存储库

如何将 GPG 私钥和公钥导出到文件

我们如何运行存储在变量中的命令？

如何配置 systemd-resolved 和 systemd-networkd 以使用本地 DNS 服务器来解析本地域和远程 DNS 服务器来解析远程域？

dist-upgrade 后 Kali Linux 中的 apt-get update 错误 [重复]

如何从 systemctl 服务日志中查看最新的 x 行

Nano - 跳转到文件末尾

grub 错误：你需要先加载内核

如何下载软件包而不是使用 apt-get 命令安装它？

是否有工具或脚本可以通过仅比较文件大小和文件内容的一小部分来快速找到重复项？

7 个回答

OP、@vume、idea 的 Shellscript 实现

背景与示例rsync

编辑1：shellscriptvumer

编辑2：扫描目录树的“oneliner”

编辑 3：shellscriptscan4dblt

编辑 4：改进的 shellscriptscan4dblt加示例（输出文件）

编辑 5：关于 shellscripts 的附加评论

最终（？）评论

相关问题

背景与示例`rsync`

编辑1：shellscript`vumer`

编辑 3：shellscript`scan4dblt`

编辑 4：改进的 shellscript`scan4dblt`加示例（输出文件）