如何在没有 sudo 的情况下使用 docker？

Question

Michael Käfer

Asked: 2014-11-22 03:35:06 +0800 CST2014-11-22 03:35:06 +0800 CST 2014-11-22 03:35:06 +0800 CST

拆分 10GB 文本文件 1) 输出文件的最小大小为 40MB 和 2) 在特定字符串 (</record>) 之后

772

我得到了一个大文本文件（10GB，.xml，包含超过 100 万个这样的标签： <record>文本</record>），我将其分成几个部分来使用它。但是为了能够自动化我的工作流程，每个部分都必须以特定标签结尾：</record>. 并且每个部分至少有40MB左右的大小也是必要的。

1 个回答

Voted

Jacob Vlijm · Answer 1 · 2014-11-25T01:33:53+08:00

下面的脚本将一个（大）文件切成片。我没有使用该split命令，因为您的文件的内容必须按记录“四舍五入”。您可以在脚本的头部部分设置切片的大小。

步骤

难点
因为脚本要能处理大文件，python的read()还是readlines()不能用；该脚本会尝试一次将整个文件加载到内存中，这肯定会阻塞您的系统。同时，必须进行划分，用整个记录“舍入”部分。因此，脚本应该能够以某种方式识别或“读取”文件的内容。

似乎唯一的选择是使用：

with open(file) as src:
    for line in src:

它逐行读取文件。

方法
在脚本中我选择了两步法：

分析文件（大小、切片数、行数、记录数、每个部分的记录），然后创建部分列表或“标记”（按行索引）。
再次读取文件，但现在将行分配给单独的文件。

将行一个一个地附加到单独的切片（文件）的过程似乎效率低下，但从我尝试的所有结果来看，它被证明是最有效、最快和最少消耗的选项。

我是如何测试
的我创建了xml一个 10GB 多一点的文件，里面充满了像你的例子这样的记录。我将切片的大小设置为45mb. 在我不太新的系统（奔腾双核 CPU E6700 @ 3.20GHz × 2）上，脚本的分析产生了以下结果：

analyzing file...

checking file size...
file size: 10767 mb
calculating number of slices...
239 slices of 45 mb
checking number of lines...
number of lines: 246236399
checking number of records...
number of records: 22386000
calculating number records per section ...
records per section: 93665

然后它开始创建 45 MB 的切片，大约需要。每片创建 25-27 秒。

creating slice 1
creating slice 2
creating slice 3
creating slice 4
creating slice 5

等等...

在此过程中，处理器占用了 45-50%，使用了大约 850-880mb 的内存（4GB）。在此过程中，计算机可以正常使用。

整个过程耗时一个半小时。在更新的系统上，它应该花费更少的时间。

剧本

#!/usr/bin/env python3

import os
import time

#---
file = "/path/to/big/file.xml" 
out_dir = "/path/to/save/slices"
size_ofslices = 45 # in mb
identifying_string = "</record>"
#---

line_number = -1
records = [0]

# analyzing file -------------------------------------------

print("analyzing file...\n")
# size in mb
print("checking file size...")
size = int(os.stat(file).st_size/1000000)
print("file size:", size, "mb")
# number of sections
print("calculating number of slices...")
sections = int(size/size_ofslices)
print(sections, "slices of", size_ofslices, "mb")
# misc. data
print("checking number of lines...")
with open(file) as src:
    for line in src:
        line_number = line_number+1
        if identifying_string in line:
            records.append(line_number)
# last index (number of lines -1)
ns_oflines = line_number
print("number of lines:", ns_oflines)
# number of records
print("checking number of records...")
ns_records = len(records)-1
print("number of records:", ns_records)
# records per section
print("calculating number records per section ...")
ns_recpersection = int(ns_records/sections)
print("records per section:", ns_recpersection)

# preparing data -------------------------------------------

rec_markers = [i for i in range(ns_records) if i% ns_recpersection == 0]+[ns_records]   # dividing records (indexes of) in slices
line_markers = [records[i] for i in rec_markers]                                        # dividing lines (indexes of) in slices
line_markers[-1] = ns_oflines; line_markers.pop(-2)                                     # setting lias linesection until last line

# creating sections ----------------------------------------

sl = 1
line_number = 0

curr_marker = line_markers[sl]
outfile = out_dir+"/"+"slice_"+str(sl)+".txt"

def writeline(outfile, line):
    with open(outfile, "a") as out:
        out.write(line)

with open(file) as src:
    print("creating slice", sl)
    for line in src:
        if line_number <= curr_marker:
            writeline(outfile, line)
        else:
            sl = sl+1
            curr_marker = line_markers[sl]
            outfile = out_dir+"/"+"slice_"+str(sl)+".txt"
            print("creating slice", sl)
            writeline(outfile, line)       
        line_number = line_number+1

如何使用

将脚本复制到一个空文件中，设置“大文件”的路径、保存切片的目录路径和切片大小。另存为slice.py并通过命令运行：

/path/to/slice.py

笔记

大文件的大小至少应该超过切片的大小几次。差异越大，（输出）切片的大小就越可靠。
假设记录的平均大小（在大图中看到）大致相同。在这里查看大量数据，人们会认为这是一个可以接受的假设，但您必须检查（通过查看切片大小是否存在很大差异）。

拆分 10GB 文本文件 1) 输出文件的最小大小为 40MB 和 2) 在特定字符串 (</record>) 之后

步骤

剧本

如何使用

笔记

如何运行 .sh 脚本？

如何安装 .tar.gz（或 .tar.bz2）文件？

如何列出所有已安装的软件包

无法锁定管理目录 (/var/lib/dpkg/) 是另一个进程在使用它吗？

拆分 10GB 文本文件 1) 输出文件的最小大小为 40MB 和 2) 在特定字符串 (</record>) 之后

1 个回答

步骤

剧本

如何使用

笔记

相关问题

如何运行 .sh 脚本？

如何安装 .tar.gz（或 .tar.bz2）文件？

如何列出所有已安装的软件包

无法锁定管理目录 (/var/lib/dpkg/) 是另一个进程在使用它吗？