AskOverflow.Dev

AskOverflow.Dev Logo AskOverflow.Dev Logo

AskOverflow.Dev Navigation

  • 主页
  • 系统&网络
  • Ubuntu
  • Unix
  • DBA
  • Computer
  • Coding
  • LangChain

Mobile menu

Close
  • 主页
  • 系统&网络
    • 最新
    • 热门
    • 标签
  • Ubuntu
    • 最新
    • 热门
    • 标签
  • Unix
    • 最新
    • 标签
  • DBA
    • 最新
    • 标签
  • Computer
    • 最新
    • 标签
  • Coding
    • 最新
    • 标签
主页 / ubuntu / 问题 / 731108
Accepted
ExNASATerry
ExNASATerry
Asked: 2016-02-08 23:23:26 +0800 CST2016-02-08 23:23:26 +0800 CST 2016-02-08 23:23:26 +0800 CST

如何仅查找具有不同名称的重复文件?

  • 772

FSlint 可以找到重复的文件。但是假设一个人有 10,000 首歌曲或图像,并且只想找到那些相同但名称不同的文件?现在,我得到了一个包含数百个欺骗的列表(在不同的文件夹中)。我希望名称保持一致,因此我只想查看具有不同名称的相同文件,而不是具有相同名称的相同文件。

具有高级参数(或不同程序)的 FSlint 可以实现这一点吗?

duplicate-files
  • 3 3 个回答
  • 5346 Views

3 个回答

  • Voted
  1. Byte Commander
    2016-02-08T23:56:42+08:002016-02-08T23:56:42+08:00

    如果脚本打印具有相同和不同文件名的所有重复文件,您可以使用以下命令行:

    find . -type f -exec sha256sum {} \; | sort | uniq -w64 --all-repeated=separate | cut -b 67-
    

    对于示例运行,我使用以下目录结构。具有相似名称(和不同编号)的文件具有相同的内容:

    .
    ├── dir1
    │   ├── uname1
    │   └── uname3
    ├── grps
    ├── lsbrelease
    ├── lsbrelease2
    ├── uname1
    └── uname2
    

    现在让我们看看我们的命令做一些魔术:

    $ find . -type f -exec sha256sum {} \; | sort | uniq -w64 --all-repeated=separate | cut -b 67-
    ./lsbrelease
    ./lsbrelease2
    
    ./dir1/uname1
    ./dir1/uname3
    ./uname1
    ./uname2
    

    由新行分隔的每个组由具有相同内容的文件组成。非重复文件未列出。

    • 9
  2. Best Answer
    Byte Commander
    2016-02-09T03:39:41+08:002016-02-09T03:39:41+08:00

    我为您提供了另一个更灵活、更易于使用的解决方案!

    复制下面的脚本并将其粘贴到/usr/local/bin/dupe-check(或任何其他位置和文件名,您需要此脚本的 root 权限)。
    通过运行以下命令使其可执行:

    sudo chmod +x /usr/local/bin/dupe-check
    

    就像/usr/local/bin在每个用户的 PATH 中一样,现在每个人都可以直接运行它而无需指定位置。

    首先,您应该查看我的脚本的帮助页面:

    $ dupe-check --help
    usage: dupe-check [-h] [-s COMMAND] [-r MAXDEPTH] [-e | -d] [-0]
                      [-v | -q | -Q] [-g] [-p] [-V]
                      [directory]
    
    Check for duplicate files
    
    positional arguments:
      directory             the directory to examine recursively (default '.')
    
    optional arguments:
      -h, --help            show this help message and exit
      -s COMMAND, --hashsum COMMAND
                            external system command to generate hashes (default
                            'sha256sum')
      -r MAXDEPTH, --recursion-depth MAXDEPTH
                            the number of subdirectory levels to process: 0=only
                            current directory, 1=max. 1st subdirectory level, ...
                            (default: infinite)
      -e, --equal-names     only list duplicates with equal file names
      -d, --different-names
                            only list duplicates with different file names
      -0, --no-zero         do not list 0-byte files
      -v, --verbose         print hash and name of each examined file
      -q, --quiet           suppress status output on stderr
      -Q, --list-only       only list the duplicate files, no summary etc.
      -g, --no-groups       do not group equal duplicates
      -p, --path-only       only print the full path in the results list,
                            otherwise format output like this: `'FILENAME'
                            (FULL_PATH)´
      -V, --version         show program's version number and exit
    

    您会看到,要获取当前目录(和所有子目录)中具有不同文件名的所有文件的列表,您需要-d标志和格式选项的任何有效组合。

    我们仍然假设相同的测试环境。具有相似名称(和不同编号)的文件具有相同的内容:

    .
    ├── dir1
    │   ├── uname1
    │   └── uname3
    ├── grps
    ├── lsbrelease
    ├── lsbrelease2
    ├── uname1
    └── uname2
    

    所以我们简单地运行:

    $ dupe-check
    Checked 7 files in total, 6 of them are duplicates by content.
    Here's a list of all duplicate files:
    
    'lsbrelease' (./lsbrelease)
    'lsbrelease2' (./lsbrelease2)
    
    'uname1' (./dir1/uname1)
    'uname1' (./uname1)
    'uname2' (./uname2)
    'uname3' (./dir1/uname3)
    

    这是脚本:

    #! /usr/bin/env python3
    
    VERSION_MAJOR, VERSION_MINOR, VERSION_MICRO = 0, 4, 1
    RELEASE_DATE, AUTHOR = "2016-02-11", "ByteCommander"
    
    import sys
    import os
    import shutil
    import subprocess
    import argparse
    
    
    class Printer:
        def __init__(self, normal=sys.stdout, stat=sys.stderr):
            self.__normal = normal
            self.__stat = stat
            self.__prev_msg = ""
            self.__first = True
            self.__max_width = shutil.get_terminal_size().columns
        def __call__(self, msg, stat=False):
            if not stat:
                if not self.__first:
                    print("\r" + " " * len(self.__prev_msg) + "\r", 
                          end="", file=self.__stat)
                print(msg, file=self.__normal)
                print(self.__prev_msg, end="", flush=True, file=self.__stat)
            else:
                if len(msg) > self.__max_width:
                    msg = msg[:self.__max_width-3] + "..."
                if not msg:
                    print("\r" + " " * len(self.__prev_msg) + "\r", 
                          end="", flush=True, file=self.__stat)
                elif self.__first:
                    print(msg, end="", flush=True, file=self.__stat)
                    self.__first = False
                else:
                    print("\r" + " " * len(self.__prev_msg) + "\r", 
                          end="", file=self.__stat)
                    print("\r" + msg, end="", flush=True, file=self.__stat)
                self.__prev_msg = msg
    
    
    def file_walker(top, maxdepth=None):
        dirs, files = [], []
        for name in os.listdir(top):
            (dirs if os.path.isdir(os.path.join(top, name)) else files).append(name)
        yield top, files
        if maxdepth != 0:
            for name in dirs:
                for x in file_walker(os.path.join(top, name), maxdepth-1):
                    yield x
    
    
    printx = Printer()
    argparser = argparse.ArgumentParser(description="Check for duplicate files")
    argparser.add_argument("directory", action="store", default=".", nargs="?",
                           help="the directory to examine recursively "
                                "(default '%(default)s')")
    argparser.add_argument("-s", "--hashsum", action="store", default="sha256sum",
                           metavar="COMMAND", help="external system command to "
                           "generate hashes (default '%(default)s')")
    argparser.add_argument("-r", "--recursion-depth", action="store", type=int,
                           default=-1, metavar="MAXDEPTH", 
                           help="the number of subdirectory levels to process: "
                           "0=only current directory, 1=max. 1st subdirectory "
                           "level, ... (default: infinite)")
    arggroupn = argparser.add_mutually_exclusive_group()
    arggroupn.add_argument("-e", "--equal-names", action="store_const", 
                           const="e", dest="name_filter",
                           help="only list duplicates with equal file names")
    arggroupn.add_argument("-d", "--different-names", action="store_const",
                           const="d", dest="name_filter",
                           help="only list duplicates with different file names")
    argparser.add_argument("-0", "--no-zero", action="store_true", default=False,
                           help="do not list 0-byte files")
    arggroupo = argparser.add_mutually_exclusive_group()
    arggroupo.add_argument("-v", "--verbose", action="store_const", const=0, 
                           dest="output_level",
                           help="print hash and name of each examined file")
    arggroupo.add_argument("-q", "--quiet", action="store_const", const=2, 
                           dest="output_level",
                           help="suppress status output on stderr")
    arggroupo.add_argument("-Q", "--list-only", action="store_const", const=3, 
                           dest="output_level",
                           help="only list the duplicate files, no summary etc.")
    argparser.add_argument("-g", "--no-groups", action="store_true", default=False,
                           help="do not group equal duplicates")
    argparser.add_argument("-p", "--path-only", action="store_true", default=False,
                           help="only print the full path in the results list, "
                                "otherwise format output like this: "
                                "`'FILENAME' (FULL_PATH)´")
    argparser.add_argument("-V", "--version", action="version", 
                           version="%(prog)s {}.{}.{} ({} by {})".format(
                           VERSION_MAJOR, VERSION_MINOR, VERSION_MICRO, 
                           RELEASE_DATE, AUTHOR))
    argparser.set_defaults(name_filter="a", output_level=1)
    args = argparser.parse_args()
    
    hashes = {}
    dupe_counter = 0
    file_counter = 0
    try:
        for root, filenames in file_walker(args.directory, args.recursion_depth):
            if args.output_level <= 1:
                printx("--> {} files ({} duplicates) processed - '{}'".format(
                        file_counter, dupe_counter, root), stat=True)
            for filename in filenames:
                path = os.path.join(root, filename)
                file_counter += 1
                filehash = subprocess.check_output(
                           [args.hashsum, path], universal_newlines=True).split()[0]
                if args.output_level == 0:
                    printx(" ".join((filehash, path)))
                if filehash in hashes:
                    dupe_counter += 1 if len(hashes[filehash]) > 1 else 2
                    hashes[filehash].append((filename, path))
                    if args.output_level <= 1:
                        printx("--> {} files ({} duplicates) processed - '{}'"
                               .format(file_counter, dupe_counter, root), stat=True)
                else:
                    hashes[filehash] = [(filename, path)]
    except FileNotFoundError:
        printx("ERROR: Directory not found!")
        exit(1)
    except KeyboardInterrupt:
        printx("USER ABORTED SEARCH!")
        printx("Results so far:")
    
    if args.output_level <= 1:
        printx("", stat=True)
        if args.output_level == 0:
            printx("")
    if args.output_level <= 2:
        printx("Checked {} files in total, {} of them are duplicates by content."
                .format(file_counter, dupe_counter))
    
    if dupe_counter == 0:
        exit(0)
    elif args.output_level <= 2:
        printx("Here's a list of all duplicate{} files{}:".format(
                " non-zero-byte" if args.no_zero else "",
                " with different names" if args.name_filter == "d" else
                " with equal names" if args.name_filter == "e" else ""))
    
    first_group = True
    for filehash in hashes:
        if len(hashes[filehash]) > 1:
            if args.no_zero and os.path.getsize(hashes[filehash][0][0]) == 0:
                continue
            first_group = False
            if args.name_filter == "a":
                filtered = hashes[filehash]
            else:
                filenames = {}
                for filename, path in hashes[filehash]:
                    if filename in filenames:
                        filenames[filename].append(path)
                    else:
                        filenames[filename] = [path]
                filtered = [(filename, path) 
                        for filename in filenames if (
                        args.name_filter == "e" and len(filenames[filename]) > 1 or
                        args.name_filter == "d" and len(filenames[filename]) == 1)
                        for path in filenames[filename]]
            if len(filtered) == 0:
                continue
            if (not args.no_groups) and (args.output_level <= 2 or not first_group):
                printx("")
            for filename, path in sorted(filtered):
                if args.path_only:
                    printx(path)
                else:
                    printx("'{}' ({})".format(filename, path))
    
    • 5
  3. ExNASATerry
    2016-02-29T16:55:50+08:002016-02-29T16:55:50+08:00

    Byte Commander 出色的脚本有效,但没有给我所需的行为(列出所有重复文件,其中至少包含一个不同名称的文件)。我进行了以下更改,现在它非常适合我的目的(并且为我节省了大量时间)!我将第 160 行更改为:

    args.name_filter == "d" and len(filenames[filename]) >= 1 and len(filenames[filename]) != len(hashes[filehash]))

    • 1

相关问题

  • 处理文件时出现空格错误

  • 如何找到重复的歌曲?

Sidebar

Stats

  • 问题 205573
  • 回答 270741
  • 最佳答案 135370
  • 用户 68524
  • 热门
  • 回答
  • Marko Smith

    如何运行 .sh 脚本?

    • 16 个回答
  • Marko Smith

    如何安装 .tar.gz(或 .tar.bz2)文件?

    • 14 个回答
  • Marko Smith

    如何列出所有已安装的软件包

    • 24 个回答
  • Marko Smith

    无法锁定管理目录 (/var/lib/dpkg/) 是另一个进程在使用它吗?

    • 25 个回答
  • Martin Hope
    Flimm 如何在没有 sudo 的情况下使用 docker? 2014-06-07 00:17:43 +0800 CST
  • Martin Hope
    Ivan 如何列出所有已安装的软件包 2010-12-17 18:08:49 +0800 CST
  • Martin Hope
    La Ode Adam Saputra 无法锁定管理目录 (/var/lib/dpkg/) 是另一个进程在使用它吗? 2010-11-30 18:12:48 +0800 CST
  • Martin Hope
    David Barry 如何从命令行确定目录(文件夹)的总大小? 2010-08-06 10:20:23 +0800 CST
  • Martin Hope
    jfoucher “以下软件包已被保留:”为什么以及如何解决? 2010-08-01 13:59:22 +0800 CST
  • Martin Hope
    David Ashford 如何删除 PPA? 2010-07-30 01:09:42 +0800 CST

热门标签

10.10 10.04 gnome networking server command-line package-management software-recommendation sound xorg

Explore

  • 主页
  • 问题
    • 最新
    • 热门
  • 标签
  • 帮助

Footer

AskOverflow.Dev

关于我们

  • 关于我们
  • 联系我们

Legal Stuff

  • Privacy Policy

Language

  • Pt
  • Server
  • Unix

© 2023 AskOverflow.DEV All Rights Reserve