多操作系统环境的首选电子邮件客户端

Question

user13185

Asked: 2009-08-07 06:02:42 +0800 CST2009-08-07 06:02:42 +0800 CST 2009-08-07 06:02:42 +0800 CST

如何在部分输入文件上运行命令

772

我有〜40GB的文件，以及一个过滤器命令，当我尝试在文件上运行它时（即使通过管道传递），由于某种原因它会中断。

但。当我将输入文件拆分为许多小文件，通过过滤器传递每个文件并连接输出时，它不会失败。

所以，我正在寻找一种方法：

将文件拆分成小块（10MB？）
对每个块运行一些命令
以正确的顺序连接输出

但没有先完全拆分文件（我不想使用那么多磁盘空间）。

我可以自己编写这样的程序，但也许已经有一些东西可以满足我的需要？

6 个回答

Voted

Kyle Brandt · Answer 1 · 2009-08-07T06:08:33+08:00

Kyle Brandt

2009-08-07T06:08:33+08:002009-08-07T06:08:33+08:00

如果您决定自己编写它并且您正在讨论文本文件，您可以将 Perl 与Tie::File模块一起使用。这使您可以一次就地处理大文件。它只是为了这种事情。

如果文件也不是文本，您可以尝试Tie::File::AnyData 。

1

Robert Swisher · Answer 2 · 2009-08-07T10:20:32+08:00

编辑：刚刚注意到您不想因为磁盘空间而提前拆分文件，这可能对您不起作用

使用拆分：

$ man split

NAME
   split - split a file into pieces

SYNOPSIS
   split [OPTION] [INPUT [PREFIX]]

DESCRIPTION
   Output fixed-size pieces of INPUT to PREFIXaa, PREFIXab, ...; default size is 1000 lines, and default PREFIX is `x'.  With no INPUT, or when INPUT is -, read standard input.

   Mandatory arguments to long options are mandatory for short options too.

   -a, --suffix-length=N
          use suffixes of length N (default 2)

   -b, --bytes=SIZE
          put SIZE bytes per output file

   -C, --line-bytes=SIZE
          put at most SIZE bytes of lines per output file

   -d, --numeric-suffixes
          use numeric suffixes instead of alphabetic

   -l, --lines=NUMBER
          put NUMBER lines per output file

   --verbose
          print a diagnostic to standard error just before each output file is opened

   --help display this help and exit

   --version
          output version information and exit

   SIZE may have a multiplier suffix: b 512, kB 1000, K 1024, MB 1000*1000, M 1024*1024, GB 1000*1000*1000, G 1024*1024*1024, and so on for T, P, E, Z, Y.

bmb · Answer 3 · 2009-08-07T10:41:25+08:00

bmb

2009-08-07T10:41:25+08:002009-08-07T10:41:25+08:00

我建议使用 sed 仅提取您想要的部分并将输出通过管道传输到您的命令中：

sed -n '1,1000p' yourfile | yourcommand

将前 1000 行通过管道传输到您的命令

sed -n '1001,2000p' yourfile | yourcommand

将管道接下来的 1000 行。

等等

如果需要，您可以将其放入脚本中的循环中。

例如

#!/bin/bash
size=1000
lines=`cat $1 | wc -l`
first=1
last=$size

while [ $last -lt $lines ] ; do
    sed -n "${first},${last}p" $1 | yourcommand
    first=`expr $last + 1`
    last=`expr $last + $size`
done

last=$lines
sed -n "${first},${last}p" $1 | yourcommand

0

Justin Ellison · Answer 4 · 2009-08-07T11:16:01+08:00

Justin Ellison

2009-08-07T11:16:01+08:002009-08-07T11:16:01+08:00

尝试这个：

#!/bin/bash

文件=/var/log/messages
块大小=100

LINE=1
总计=`wc -l $FILE | 剪切-d''-f1`
而 [ $LINE -le $TOTAL ]; 做
  让 ENDLINE=$LINE+$CHUNKSIZE
  sed "${LINE},${ENDLINE}p" $FILE | grep -i "标记"
  让 LINE=$ENDLINE+1
完毕

0

user13185 · Answer 5 · 2009-08-07T11:34:31+08:00

user13185

2009-08-07T11:34:31+08:002009-08-07T11:34:31+08:00

好吧 - 对于每个建议编写我自己的解决方案的人。我可以。而且我什至可以在没有多次“扫描”输入文件的情况下做到这一点。但问题/问题是：有没有现成的工具？

最简单的基于 Perl 的方法可能如下所示：

#!/usr/bin/perl -w
use strict;

my ( $lines, $command ) = @ARGV;

open my $out, '|-', $command;

my $i = 0;
while (<STDIN>) {
    $i++;
    if ($i > $lines) {
        close $out;
        open $out, '|-', $command;
        $i = 1;
    }
    print $out $_;
}

close $out;

exit;

现在我可以：

=> seq 1 5
1
2
3
4
5

=> seq 1 5 | ./run_in_parts.pl 3 tac
3
2
1
5
4

0

200_success · Answer 6 · 2009-08-07T14:00:24+08:00

你不是第一个遇到这个问题的人iconv。有人写了一个Perl 脚本来解决它。

iconv不能很好地处理大文件。从 glibc 源代码，在iconv/iconv_prog.c：

/* Since we have to deal with
   arbitrary encodings we must read the whole text in a buffer and
   process it in one step.  */

但是，对于您的特定情况，编写自己的 UTF-8 验证器可能会更好。您可以轻松地提炼iconv -c -f utf8 -t utf8成一个小的 C 程序，其中包含一个调用iconv(3). 由于 UTF-8 是无模式和自同步的，因此您可以分块处理它。

#include <errno.h>
#include <iconv.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>

#define BUFSIZE 4096

/* Copy STDIN to STDOUT, omitting invalid UTF-8 sequences */
int main() {
    char ib[BUFSIZE], ob[BUFSIZE], *ibp, *obp;
    ssize_t bytes_read;
    size_t iblen = 0, oblen;
    unsigned long long total;
    iconv_t cd;

    if ((iconv_t)-1 == (cd = iconv_open("utf8", "utf8"))) {
        perror("iconv_open");
        return 2;
    }

    for (total = 0;
         bytes_read = read(STDIN_FILENO, ib + iblen, sizeof(ib) - iblen);
         total += bytes_read - iblen) {

        if (-1 == bytes_read) {     /* Handle read error */
            perror("read");
            return 1;
        }
        ibp = ib; iblen += bytes_read;
        obp = ob; oblen = sizeof(ob);
        if (-1 == iconv(cd, &ibp, &iblen, &obp, &oblen)) {
            switch (errno) {
              case EILSEQ:          /* Invalid input multibyte sequence */
                fprintf(stderr, "Invalid multibyte sequence at byte %llu\n",
                        1 + total + sizeof(ib) - iblen);
                ibp++; iblen--;     /* Skip the bad byte next time */
                break;
              case EINVAL:          /* Incomplete input multibyte sequence */               
                break;
              default:
                perror("iconv");
                return 2;
            }
        }
        write(STDOUT_FILENO, ob, sizeof(ob) - oblen);

        /* There are iblen bytes at the end of ib that follow an invalid UTF-8
           sequence or are part of an incomplete UTF-8 sequence.  Move them to  
           the beginning of ib. */
        memmove(ib, ibp, iblen);
    }
    return iconv_close(cd);
}

如何在部分输入文件上运行命令

SFTP 使用什么端口？

从 IP 地址解析主机名

如何按大小对 du -h 输出进行排序

命令行列出 Windows Active Directory 组中的用户？

Windows 中执行反向 DNS 查找的命令行实用程序是什么？

如何检查 Windows 机器上的端口是否被阻塞？

我应该打开哪个端口以允许远程桌面？

什么是 Pem 文件，它与其他 OpenSSL 生成的密钥文件格式有何不同？

如何确定bash变量是否为空？

如何在部分输入文件上运行命令

6 个回答

相关问题