/*---------------------------------------------------------------------------
Find and process the end-of-central-directory header. UnZip need only
check last 65557 bytes of zipfile: comment may be up to 65535, end-of-
central-directory record is 18 bytes, and signature itself is 4 bytes;
add some to allow for appended garbage. Since ZipInfo is often used as
a debugging tool, search the whole zipfile if zipinfo_mode is true.
---------------------------------------------------------------------------*/
/*---------------------------------------------------------------------------
The basic idea of this function is as follows. Since the central di-
rectory lies at the end of the zipfile and the member files lie at the
beginning or middle or wherever, it is not very desirable to simply
read a central directory entry, jump to the member and extract it, and
then jump back to the central directory. In the case of a large zipfile
this would lead to a whole lot of disk-grinding, especially if each mem-
ber file is small. Instead, we read from the central directory the per-
tinent information for a block of files, then go extract/test the whole
block. Thus this routine contains two small(er) loops within a very
large outer loop: the first of the small ones reads a block of files
from the central directory; the second extracts or tests each file; and
the outer one loops over blocks. There's some file-pointer positioning
stuff in between, but that's about it. Btw, it's because of this jump-
ing around that we can afford to be lenient if an error occurs in one of
the member files: we should still be able to go find the other members,
since we know the offset of each from the beginning of the zipfile.
---------------------------------------------------------------------------*/
在大型存档中搜索单个文件时,它使用方法 1,您可以使用以下方法查看
strace
:unzip
opensdataset.zip
,寻找到结尾,然后寻找存档中请求文件的开头(rand-28.txt
,在偏移量 849346560 处)并从那里读取。通过扫描档案的最后 65557 个字节找到中央目录;查看从这里开始的代码:
其实它是一种混合物。unzip 从已知位置读取一些数据,然后读取与 zip 文件中的目标条目相关(但不相同)的数据块。
源文件的注释中解释了 zip/unzip 的设计。这是相关的一个
extract.c
:格式本身主要来源于 PK-Ware 的实现,并在编程信息 text-files中进行了总结。据此,中央目录中也有不止一种类型的记录,因此 unzip 不能轻易地转到文件末尾并创建一个条目数组来查找目标文件。
现在...如果您花时间阅读源代码,您会发现
unzip
读取 8192 字节的缓冲区(查找INBUFSIZ
)。我只会将单文件提取用于相当大的 zip 文件(我想到了 Java 源代码),但即使是较小的 zip 文件,您也可以看到缓冲区大小的影响。为了看到这一点,我压缩了 PuTTY 的 Git 文件,它提供了 2727 个文件(计算 git 日志的副本)。Java 比 20 年前更大,并且没有缩小。从 zip 文件中提取该日志(选择它是因为它不会在按字母顺序排序的索引的末尾,并且可能不在从中央目录读取的第一个块中)strace
为lseek
调用提供了以下信息:像往常一样,使用基准ymmv。