删除文件名后缀最小的文件

Question

iBug

Asked: 2018-01-14 00:37:29 +0800 CST2018-01-14 00:37:29 +0800 CST 2018-01-14 00:37:29 +0800 CST

从大文件末尾删除空字节

772

我刚刚使用以下命令在运行 Linux 发行版的 PC 上备份了我的 Raspberry Pi 中的 microSD 卡：

dd if=/dev/sdx of=file.bin bs=16M

microSD 卡只有 3/4 满，所以我想在这个巨大文件的末尾有几个空字节。我很确定我不需要那个。如何有效地从末尾删除那些空字节，以便以后可以使用此命令恢复它？

cat file.bin /dev/zero | dd of=/dev/sdx bs=16M

6 个回答

Voted

John1024 · Answer 1 · 2018-01-14T00:55:51+08:00

Best Answer

John1024

2018-01-14T00:55:51+08:002018-01-14T00:55:51+08:00

要在节省空间的同时创建磁盘的备份副本，请使用gzip：

gzip </dev/sda >/path/to/sda.gz

如果要从备份中恢复磁盘，请使用：

gunzip -c /path/to/sda.gz >/dev/sda

与仅仅剥离尾随 NUL 字节相比，这可能会节省更多空间。

删除尾随 NUL 字节

如果你真的想删除尾随的 NUL 字节并且你有 GNU sed，你可以尝试：

sed '$ s/\x00*$//' /dev/sda >/path/to/sda.stripped

如果大磁盘的数据超过 sed 的某些内部限制，这可能会遇到问题。虽然 GNU sed 对数据大小没有内置限制，但GNU sed 手册解释说系统内存限制可能会阻止处理大文件：

GNU sed 对行长没有内置限制；只要它可以 malloc() 更多（虚拟）内存，您就可以随心所欲地馈送或构造行。

但是，递归用于处理子模式和无限重复。这意味着可用的堆栈空间可能会限制某些模式可以处理的缓冲区的大小。

8

zqb-all · Answer 2 · 2020-04-07T17:10:19+08:00

你可以编写一个简单的工具来解决这个问题。

读取文件，找出最后一个有效字节（非空），然后截断文件。

来自https://github.com/zqb-all/cut-trailing-bytes的 rust 示例：

use std::io;
use std::io::prelude::*;
use std::fs::File;
use std::fs::OpenOptions;
use std::path::PathBuf;
use structopt::StructOpt;
use std::num::ParseIntError;

fn parse_hex(s: &str) -> Result<u8, ParseIntError> {
    u8::from_str_radix(s, 16)
}

#[derive(Debug, StructOpt)]
#[structopt(name = "cut-trailing-bytes", about = "A tool for cut trailing bytes, default cut trailing NULL bytes(0x00 in hex)")]
struct Opt {
    /// File to cut
    #[structopt(parse(from_os_str))]
    file: PathBuf,

    /// For example, pass 'ff' if want to cut 0xff
    #[structopt(short = "c", long = "cut-byte", default_value="0", parse(try_from_str = parse_hex))]
    byte_in_hex: u8,

    /// Check the file but don't real cut it
    #[structopt(short, long = "dry-run")]
    dry_run: bool,
}


fn main() -> io::Result<()> {

    let opt = Opt::from_args();
    let filename = &opt.file;
    let mut f = File::open(filename)?;
    let mut valid_len = 0;
    let mut tmp_len = 0;
    let mut buffer = [0; 4096];

    loop {
        let mut n = f.read(&mut buffer[..])?;
        if n == 0 { break; }
        for byte in buffer.bytes() {
            match byte.unwrap() {
                byte if byte == opt.byte_in_hex => { tmp_len += 1; }
                _ => {
                    valid_len += tmp_len;
                    tmp_len = 0;
                    valid_len += 1;
                }
            }
            n -= 1;
            if n == 0 { break; }
        }
    }
    if !opt.dry_run {
        let f = OpenOptions::new().write(true).open(filename);
        f.unwrap().set_len(valid_len)?;
    }
    println!("cut {} from {} to {}", filename.display(), valid_len + tmp_len, valid_len);

    Ok(())
}

letmaik · Answer 3 · 2020-04-22T08:40:17+08:00

letmaik

2020-04-22T08:40:17+08:002020-04-22T08:40:17+08:00

我尝试了 John1024 的sed命令，它大部分时间都有效，但对于一些大文件，它没有正确修剪。以下将始终有效：

python -c "open('file-stripped.bin', 'wb').write(open('file.bin', 'rb').read().rstrip(b'\0'))"

请注意，这首先将文件加载到内存中。您可以通过编写适当的 Python 脚本来分块处理文件来避免这种情况。

1

Ella Jameson · Answer 4 · 2022-05-05T21:12:41+08:00

前言

我刚刚用 Python 自己解决了这个问题。这在理论上很简单，但在实践中，它实际上需要相当多的代码才能正确完成。我想在这里分享我的工作，这样其他人就不必自己解决这个问题。

简单（坏）的方式

最简单的方法（之前由 letmaik 发布）是将文件作为字节串加载到内存中，使用 Python.rstrp()从字节串中删除尾随空字节，然后将该字节串保存在原始文件上。

def strip_file_blank_space(filename):
    # Strips null bytes at the end of a file, and returns the new file size
    # This will process the file all at once
    # Open the file for reading bytes (then close)
    with open(filename, "rb") as f:
        # Read all of the data into memory
        data = f.read()
    # Strip trailing null bytes from the data in-memory
    data = data.rstrip(b'\x00')
    # Open the file for writing bytes (then close)
    with open(filename, "wb") as f:
        # Write the data from memory to the disk
        f.write(data)
    # Return the new file size
    return(len(data))

new_size = strip_file_blank_space("file.bin")

假设文件小于可用的系统内存，这可能在大多数情况下都有效。但是对于较大的文件 (32+ GB) 或在 RAM 较少的系统 (Raspberry Pi) 上，该进程要么使计算机崩溃，要么被系统内存管理器杀死。

困难（正确）的方式

解决有限内存问题的唯一方法是一次加载一小块数据，对其进行处理，将其从内存中删除，然后在下一个小块上重复，直到处理完整个文件。通常你可以用非常紧凑的代码在 python 中做到这一点，但是因为我们需要从文件末尾开始处理块，向后移动，这需要更多的工作。

我已经为你完成了工作。这里是：

import os
import shutil
from math import floor
import warnings
import tempfile

def strip_file_blank_space(filename, block_size=1*(1024*1024)):
    # Strips null bytes at the end of a file, and returns the new file size
    # This will process the file in chunks, to conserve memory (default = 1 MiB)
    file_end_loc = None # This will be used if the file is larger than the block size
    simple_data = None # This is used if the file is smaller than the block size
    # Open the source file for reading
    with open(filename, "rb") as f:
        # Get original file size
        filesize = os.fstat(f.fileno()).st_size
        # Test if file size is less than (or equal to) the block size
        if filesize <= block_size:
            # Load data to do a normal rstrip all in-memory
            simple_data = f.read()
        # If the file is larger than the specified block size
        else:
            # Compute number of whole blocks (remainder at beginning processed seperately)
            num_whole_blocks = floor(filesize / block_size)
            # Compute number of remaining bytes
            num_bytes_partial_block = filesize - (num_whole_blocks * block_size)
            # Go through each block, looking for the location where the zeros end
            for block in range(num_whole_blocks):
                # Set file position, relative to the end of the file
                current_position = filesize - ((block+1) * block_size)
                f.seek(current_position)
                # Read current block
                block_data = f.read(block_size)
                # Strip current block from right side
                block_data = block_data.rstrip(b"\x00")
                # Test if the block data was all zeros
                if len(block_data) == 0:
                    # Move on to next block
                    continue
                # If it was not all zeros
                else:
                    # Find the location in the file where the real data ends
                    blocks_not_processed = num_whole_blocks - (block+1)
                    file_end_loc = num_bytes_partial_block + (blocks_not_processed * block_size) + len(block_data)
                    break
            # Test if the end location was not found in the full blocks loop
            if file_end_loc == None:
                # Read partial block at the beginning of the file
                f.seek(0)
                partial_block_data = f.read(num_bytes_partial_block)
                # Strip from the right side
                partial_block_data = partial_block_data.rstrip(b"\x00")
                # Test if this block (and therefore the entire file) is zeros
                if len(partial_block_data) == 0:
                    # Warn about the empty file
                    warnings.warn("File was all zeros and will be replaced with an empty file")
                # Set the location where the real data begins
                file_end_loc = len(partial_block_data)
    
    # If we are doing a normal strip:
    if simple_data != None:
        # Strip right trailing null bytes
        simple_data = simple_data.rstrip(b'\x00')
        # Directly replace file
        with open(filename, "wb") as f:
            f.write(simple_data)
            new_filesize = os.fstat(f.fileno()).st_size
        # Return the new file size
        return len(simple_data)
    # If we are doing a block-by-block copy and replace
    else:
        # Create temporary file (do not delete, will move myself)
        temp_file = tempfile.NamedTemporaryFile(mode="wb", delete=False)
        # Open the source file for reading
        with open(filename, "rb") as f:
            # Test if data is smaller than (or equal to) the block size
            if file_end_loc <= block_size:
                # Do a direct copy
                f.seek(0)
                data = f.read(file_end_loc)
                temp_file.write(data)
                temp_file.close()
            # If the data is larger than the block size
            else:
                # Find number of whole blocks to copy
                num_whole_blocks_copy = floor(file_end_loc / block_size)
                # Find partial block data size (at the end of the file this time)
                num_bytes_partial_block_copy = file_end_loc - (num_whole_blocks_copy * block_size)
                # Copy whole blocks
                f.seek(0)
                for block in range(num_whole_blocks_copy):
                    # Read block data (automatically moves position)
                    block_data = f.read(block_size)
                    # Write block to temp file
                    temp_file.write(block_data)
                # Test for any partial block data
                if num_bytes_partial_block_copy > 0:
                    # Read remaining data
                    partial_block_data = f.read(num_bytes_partial_block_copy)
                    # Write remaining data to temp file
                    temp_file.write(partial_block_data)
                # Close temp file
                temp_file.close()
        # Delete original file
        os.remove(filename)
        # Replace original with temporary file
        shutil.move(temp_file.name, filename)
        # Return the new file size
        return(file_end_loc)

new_size = strip_file_blank_space("file.bin") # Defaults to 1 MiB blocks

如您所见，它需要更多的代码行，但如果您正在阅读本文，那么您现在不必编写这些代码行！别客气。:)

我已经在具有 1 GB RAM 的 Raspberry Pi 上使用 4+ GB 文件测试了此功能，并且该过程使用的内存总量从未超过 50 MB。处理需要一段时间，但它工作得完美无缺。

结论

编程时，请注意在任何给定时间将多少数据加载到内存中。请记住您将使用的潜在最大文件大小，以及您可用内存的潜在下限。

我希望这可以帮助某人！

Stéphane Chazelas · Answer 5 · 2022-05-05T21:32:22+08:00

至少在 Linux 上（以及支持它的文件系统，例如现代 ext4），您可以使用fallocate -d不占用任何磁盘空间的孔来替换那些零序列：

$ echo test > a
$ head -c1G /dev/zero >> a
$ echo test2 >> a
$ head -c1G /dev/zero >> a
$ du -h a
2.1G    a
$ ls -l a
-rw-r--r-- 1 stephane stephane 2147483659 May  5 06:23 a

2GiB 大文件占用 2GiB 磁盘空间。

$ fallocate -d a
$ ls -l a
-rw-r--r-- 1 stephane stephane 2147483659 May  5 06:23 a
$ du -h a
12K     a

相同的 2GiB 大文件，但现在只占用 12KiB 的磁盘空间。

$ filefrag -v a
Filesystem type is: ef53
File size of a is 2147483659 (524289 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..       0:    7504727..   7504727:      1:
   1:   262144..  262144:   48424960..  48424960:      1:    7766871: last
a: 2 extents found

您可以使用以下方法删除拖尾孔：

truncate -os 262145 a

最后一个块现在应该包含数据：

$ tail -c4096 a | hd
00000000  00 00 00 00 00 74 65 73  74 32 0a 00 00 00 00 00  |.....test2......|
00000010  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00001000

虽然您也可以删除最后一个块中的尾随零，但请注意，它不会节省磁盘上的任何空间。

Stéphane Chazelas · Answer 6 · 2022-05-05T22:05:35+08:00

Stéphane Chazelas

2022-05-05T22:05:35+08:002022-05-05T22:05:35+08:00

请注意，并不是因为 3/4 的文件系统未分配，底层存储设备的相应块将包含零。如果一些文件之前被写入，但之后被删除，旧数据仍然存在，只有相应的块将在文件系统的结构中被标记为未分配。

一个例外可能是 TRIM/DISCARD 支持在块设备上可用并且在挂载文件系统时使用。

在 Linux 上，这些loop设备确实支持修剪，并且如果底层文件系统支持，将在文件中创建相应的循环孔，因此您可以将文件系统挂载到您的映像上：

sudo mount -o loop file.bin /somewhere

并做一个：

sudo fstrim /somewhere

丢弃文件系统的未分配块。

如果图像已分区：

sudo losetup -fP --show file.bin

然后挂载相应的/dev/loopXpY分区。

您可能还想查看诸如partimage转储您的 sdcard 之类的事情。这将只负责转储分配的位。

在转储之前在 sdcard 上使用zerofree也可以确保未分配的部分用零填充。

0

从大文件末尾删除空字节

删除尾随 NUL 字节

如何将 GPG 私钥和公钥导出到文件

ssh 无法协商：“找不到匹配的密码”，正在拒绝 cbc

我们如何运行存储在变量中的命令？

如何配置 systemd-resolved 和 systemd-networkd 以使用本地 DNS 服务器来解析本地域和远程 DNS 服务器来解析远程域？

如何卸载内核模块“nvidia-drm”？

dist-upgrade 后 Kali Linux 中的 apt-get update 错误 [重复]

如何从 systemctl 服务日志中查看最新的 x 行

Nano - 跳转到文件末尾

grub 错误：你需要先加载内核

如何下载软件包而不是使用 apt-get 命令安装它？

从大文件末尾删除空字节

6 个回答

删除尾随 NUL 字节

相关问题