从 zip 存档中排除子目录——请解释语法

Question

tangy

Asked: 2019-01-30 09:59:38 +0800 CST2019-01-30 09:59:38 +0800 CST 2019-01-30 09:59:38 +0800 CST

unzip 使用什么方法在存档中查找单个文件？

772

假设我创建了 100 个文件，每个文件的随机文本数据大小为 30MB。现在我创建一个压缩率为 0 的 zip 存档，即zip dataset.zip -r -0 *.txt. 现在我想从这个档案中只提取一个文件。

如此处所述，有两种方法可以从档案中解压/提取文件：

查找文件末尾并查找中央目录。然后使用它来快速随机访问要提取的文件。（摊销O(1)复杂度）
查看每个本地标头并提取匹配的标头。（O(n)复杂性）

unzip 使用哪种方法？从我的实验来看，它似乎使用了方法 2？

2 个回答

Voted

Stephen Kitt · Answer 1 · 2019-01-30T10:05:24+08:00

在大型存档中搜索单个文件时，它使用方法 1，您可以使用以下方法查看strace：

open("dataset.zip", O_RDONLY)           = 3
ioctl(1, TIOCGWINSZ, 0x7fff9a895920)    = -1 ENOTTY (Inappropriate ioctl for device)
write(1, "Archive:  dataset.zip\n", 22Archive:  dataset.zip
) = 22
lseek(3, 943718400, SEEK_SET)           = 943718400
read(3, "\340P\356(s\342\306\205\201\27\360U[\250/2\207\346<\252+u\234\225\1[<\2310E\342\274"..., 4522) = 4522
lseek(3, 943722880, SEEK_SET)           = 943722880
read(3, "\3\f\225P\\ux\v\0\1\4\350\3\0\0\4\350\3\0\0", 20) = 20
lseek(3, 943718400, SEEK_SET)           = 943718400
read(3, "\340P\356(s\342\306\205\201\27\360U[\250/2\207\346<\252+u\234\225\1[<\2310E\342\274"..., 8192) = 4522
lseek(3, 849346560, SEEK_SET)           = 849346560
read(3, "D\262nv\210\343\240C\24\227\344\367q\300\223\231\306\330\275\266\213\276M\7I'&35\2\234J"..., 8192) = 8192
stat("rand-28.txt", 0x559f43e0a550)     = -1 ENOENT (No such file or directory)
lstat("rand-28.txt", 0x559f43e0a550)    = -1 ENOENT (No such file or directory)
stat("rand-28.txt", 0x559f43e0a550)     = -1 ENOENT (No such file or directory)
lstat("rand-28.txt", 0x559f43e0a550)    = -1 ENOENT (No such file or directory)
open("rand-28.txt", O_RDWR|O_CREAT|O_TRUNC, 0666) = 4
ioctl(1, TIOCGWINSZ, 0x7fff9a895790)    = -1 ENOTTY (Inappropriate ioctl for device)
write(1, " extracting: rand-28.txt        "..., 37 extracting: rand-28.txt             ) = 37
read(3, "\275\3279Y\206\223\217}\355W%:\220YNT\0\257\260z^\361T\242\2\370\21\336\372+\306\310"..., 8192) = 8192

unzipopens dataset.zip，寻找到结尾，然后寻找存档中请求文件的开头（rand-28.txt，在偏移量 849346560 处）并从那里读取。

通过扫描档案的最后 65557 个字节找到中央目录；查看从这里开始的代码：

/*---------------------------------------------------------------------------
    Find and process the end-of-central-directory header.  UnZip need only
    check last 65557 bytes of zipfile:  comment may be up to 65535, end-of-
    central-directory record is 18 bytes, and signature itself is 4 bytes;
    add some to allow for appended garbage.  Since ZipInfo is often used as
    a debugging tool, search the whole zipfile if zipinfo_mode is true.
  ---------------------------------------------------------------------------*/

Thomas Dickey · Answer 2 · 2019-01-30T13:43:20+08:00

其实它是一种混合物。unzip 从已知位置读取一些数据，然后读取与 zip 文件中的目标条目相关（但不相同）的数据块。

源文件的注释中解释了 zip/unzip 的设计。这是相关的一个extract.c：

/*--------------------------------------------------------------------------- 
    The basic idea of this function is as follows.  Since the central di- 
    rectory lies at the end of the zipfile and the member files lie at the 
    beginning or middle or wherever, it is not very desirable to simply 
    read a central directory entry, jump to the member and extract it, and 
    then jump back to the central directory.  In the case of a large zipfile 
    this would lead to a whole lot of disk-grinding, especially if each mem- 
    ber file is small.  Instead, we read from the central directory the per- 
    tinent information for a block of files, then go extract/test the whole 
    block.  Thus this routine contains two small(er) loops within a very 
    large outer loop:  the first of the small ones reads a block of files 
    from the central directory; the second extracts or tests each file; and 
    the outer one loops over blocks.  There's some file-pointer positioning 
    stuff in between, but that's about it.  Btw, it's because of this jump- 
    ing around that we can afford to be lenient if an error occurs in one of 
    the member files:  we should still be able to go find the other members, 
    since we know the offset of each from the beginning of the zipfile. 
  ---------------------------------------------------------------------------*/

格式本身主要来源于 PK-Ware 的实现，并在编程信息 text-files中进行了总结。据此，中央目录中也有不止一种类型的记录，因此 unzip 不能轻易地转到文件末尾并创建一个条目数组来查找目标文件。

现在...如果您花时间阅读源代码，您会发现unzip读取 8192 字节的缓冲区（查找INBUFSIZ）。我只会将单文件提取用于相当大的 zip 文件（我想到了 Java 源代码），但即使是较小的 zip 文件，您也可以看到缓冲区大小的影响。为了看到这一点，我压缩了 PuTTY 的 Git 文件，它提供了 2727 个文件（计算 git 日志的副本）。Java 比 20 年前更大，并且没有缩小。从 zip 文件中提取该日志（选择它是因为它不会在按字母顺序排序的索引的末尾，并且可能不在从中央目录读取的第一个块中）strace为lseek调用提供了以下信息：

lseek(3, -2252, SEEK_CUR)               = 1267
lseek(3, 120463360, SEEK_SET)           = 120463360
lseek(3, 120468731, SEEK_SET)           = 120468731
lseek(3, 120135680, SEEK_SET)           = 120135680
lseek(3, 270336, SEEK_SET)              = 270336
lseek(3, 120463360, SEEK_SET)           = 120463360

像往常一样，使用基准ymmv。

unzip 使用什么方法在存档中查找单个文件？

模块 i915 可能缺少固件 /lib/firmware/i915/*

无法获取 jessie backports 存储库

如何将 GPG 私钥和公钥导出到文件

我们如何运行存储在变量中的命令？

如何配置 systemd-resolved 和 systemd-networkd 以使用本地 DNS 服务器来解析本地域和远程 DNS 服务器来解析远程域？

dist-upgrade 后 Kali Linux 中的 apt-get update 错误 [重复]

如何从 systemctl 服务日志中查看最新的 x 行

Nano - 跳转到文件末尾

grub 错误：你需要先加载内核

如何下载软件包而不是使用 apt-get 命令安装它？

unzip 使用什么方法在存档中查找单个文件？

2 个回答

相关问题