AskOverflow.Dev

AskOverflow.Dev Logo AskOverflow.Dev Logo

AskOverflow.Dev Navigation

  • 主页
  • 系统&网络
  • Ubuntu
  • Unix
  • DBA
  • Computer
  • Coding
  • LangChain

Mobile menu

Close
  • 主页
  • 系统&网络
    • 最新
    • 热门
    • 标签
  • Ubuntu
    • 最新
    • 热门
    • 标签
  • Unix
    • 最新
    • 标签
  • DBA
    • 最新
    • 标签
  • Computer
    • 最新
    • 标签
  • Coding
    • 最新
    • 标签
主页 / ubuntu / 问题 / 856336
Accepted
αғsнιη
αғsнιη
Asked: 2016-12-03 16:13:42 +0800 CST2016-12-03 16:13:42 +0800 CST 2016-12-03 16:13:42 +0800 CST

在x时间间隔内根据时间戳处理文件记录

  • 772

我有一个文件,其中的一部分作为示例,如下所示,其中包含一个时间戳字段:

20161203001211,00
20161203001200,00
20161203001500,102
20161203003224,00
20161203001500,00
20161203004211,00
20161203005659,102
20161203000143,103
20161202001643,100
....

我想根据时间戳处理此文件,以计算 15 分钟间隔内的出现次数。我知道如何每分钟做一次,我也使用awk脚本在 10 分钟间隔内完成,但不知道如何才能在 15 分钟间隔内获得以下输出:

startTime-endTime             total SUCCESS FAILED    
20161203000000-20161203001500 5     3       2
20161203001500-20161203003000 2     1       1
20161203003000-20161203004500 2     2       0
20161203004500-20161203010000 1     0       1
20161202000000-20161202001500 0     0       0
20161202001500-20161202003000 1     0       1
....

00 表示成功,其他情况表示失败记录。

是的,它是 24 小时,所以一天中的每个小时应该有 4 条间隔打印记录。

command-line
  • 1 1 个回答
  • 879 Views

1 个回答

  • Voted
  1. Best Answer
    Jacob Vlijm
    2016-12-04T12:25:39+08:002016-12-04T12:25:39+08:00

    在时间戳数据文件上编写报告;复杂的要求


    虽然最初的问题有点复杂,但问题的背景使它变得相当困难。其他情况(如聊天中所述):

    • 该脚本需要将多个带时间戳的文件合并到一个报告中,可能分布在多个带日期标记的文件夹中(取决于设置的时间范围)。
    • 该脚本需要能够选择一个时间范围,不仅可以从文件名上的时间戳中选择,还可以从文件行中读取的子范围。
    • 没有数据的时间段(季度)应报告“归零”输出
    • 输出中的时间格式,包括报告名称和报告的行(每 15 分钟),需要与输入格式不同。
    • 要处理的行需要满足脚本必须检查的条件
    • 相关数据可能位于该行的不同位置
    • 脚本需要是python2
    • 该脚本必须考虑本地时间和 UTC 之间的(可变)差异。
    • 添加了额外选项:导出到基本 csv 的选项,可选列开/关
    • 最后但同样重要的是:要处理的数据量非常大;数千个文件,每个文件数十万行,许多 GB,数百万行。换句话说:程序必须智能且高效,才能在合理的时间内处理数据。

    解释

    最终结果过于全面,无法详细解释,但是,对于那些感兴趣的人,头条新闻:

    • 所有时间计算都是在纪元时间内完成的(难怪)
    • 读取文件的行,首先是检查每行的条件,以减少要立即处理的行数
    • 从这些行中,时间戳转换为 epoch 后,除以 900(秒,15 分钟),向下取整(取int(n)),然后再次乘以900,计算它们所属的 15 分钟部分
    • itertools随后按'对行进行排序和分组,并在( )groupby的帮助下生成每组的结果ifilterpython2
    • 随后首先按文件创建报告,因为报告是每 15 分钟一次。每个文件报告的输出不能超过几十行。暂时存储到内存中没有可能的问题。
    • 一旦以这种方式处理了所有相关文件和行,所有报告最终合并为一个最终报告

    尽管数据量很大,但该脚本却能很好地完成这项工作。在处理时,处理器在我 10 多年的旧系统上显示大约 70% 的占用,运行稳定。该计算机仍然可以很好地用于其他任务。

    剧本

    #!/usr/bin/env python2
    import time
    import datetime
    from itertools import groupby, ifilter
    from operator import itemgetter
    import sys
    import os
    import math
    
    """
    folders by day stamp: 20161211 (yyymmdd)
    files by full readable (start) time 20161211093512 (yyyymmddhhmmss) + header / tail
    records inside files by full start time 20161211093512 (yyyymmddhhmmss)
    commands are in UTC, report name and time section inside files: + timeshift
    """
    
    ################## settings  ##################
    
    # --- format settings (don't change) ---
    readable = "%Y%m%d%H%M%S"
    outputformat = "%d-%m-%Y %H:%M"
    dateformat = "%Y%m%d"
    
    #---------- time settings ----------
    # interval (seconds)
    interval = 900
    # time shift UTC <> local (hrs)
    timeshift = 3.5
    # start from (minutes from now in the past)
    backintime = 700
    
    # ---- dynamically set values -------
    # condition (string/position)
    iftrue = ["mies", 2]
    # relevant data (timestamp, result)
    data = [0, 1]
    # datafolder
    datafolder = "/home/jacob/Bureaublad/KasIII"
    
    # ----- output columns------
    # 0 = timestamp, 1 = total, 2 = SUCCESS, 3 = FAILS
    # don't change the order though, distances will mess up
    items = [0, 1, 2, 3]
    # include simple csv file
    csv = True
    
    ###############################################
    
    start = sys.argv[1]
    end = sys.argv[2]
    output_path = sys.argv[3]
    
    timeshift = timeshift*3600
    
    def extraday():
        """
        function to determine what folders possibly contain relevant files
        options: today or *also* yesterday
        """
        current_time = [
            getattr(datetime.datetime.now(), attr) \
            for attr in ['hour', 'minute']]
        minutes = (current_time[0]*60)+current_time[1]                   
        return backintime >= minutes
    
    extraday()
    
    def set_layout(line):
        # take care of a nice output format
        line = [str(s) for s in line]
        dist1 = (24-len(line[0]))*" "
        dist2 = (15-len(line[1]))*" "
        dist3 = (15-len(line[2]))*" "
        distances = [dist1, dist2, dist3, ""]
        displayed = "".join([line[i]+distances[i] for i in items])
        return displayed
    
    
        # return line[0]+dist1+line[1]+dist2+line[2]+dist3+line[3]
    
    def convert_toepoch(pattern, stamp):
        """
        function to convert readable format (any) into epoch
        """
        return int(time.mktime(time.strptime(stamp, pattern)))
    
    def convert_toreadable(pattern, stamp, shift=0):
        """
        function to convert epoch into readable (any)
        possibly with a time shift
        """
        return time.strftime(pattern, time.gmtime(stamp+shift))
    
    def getrelevantfiles(backtime):
        """
        get relevant files from todays subfolder, from starttime in the past
        input format of backtime is minutes
        """
        allrelevant = []
        # current time, in epoch, to select files
        currt = int(time.time())
        dirs = [convert_toreadable(dateformat, currt)]
        # if backintime > today's "age", add yesterday
        if extraday():
            dirs.append(convert_toreadable(dateformat, currt-86400))
        print("Reading from: "+str(dirs))
        # get relevant files from folders
        for dr in dirs:
            try:
                relevant = [
                    [f, convert_toepoch(readable, f[7:21])]
                    for f in os.listdir(os.path.join(datafolder, dr))
                    ]
                allrelevant = allrelevant + [
                    os.path.join(datafolder, dr, f[0])\
                    for f in relevant if f[1] >= currt-(backtime*60)
                    ]
            except (IOError, OSError):
                print "Folder not found:", dr
        return allrelevant
    
    def readfile(file):
        """
        create the line list to work with, meeting the iftrue conditions
        select the relevant lines from the file, meeting the iftrue condition
        """
        lines = []
        with open(file) as read:
            for l in read:
                l = l.split(",")
                if l[iftrue[1]].strip() == iftrue[0]:
                    lines.append([l[data[0]], l[data[1]]])
        return lines
    
    def timeselect(lines):
        """
        select lines from a list that meet the start/end time
        input is the filtered list of lines, by readfile()
        """
        return [l for l in lines if int(start) <= int(l[0]) < int(end)]
    
    def convert_tosection(stamp):
        """
        convert the timestamp in a line to the section (start) it belongs to
        input = timestamp, output = epoch
        """
        rsection = int(convert_toepoch(readable, stamp)/interval)*interval
        return rsection
    
    reportlist = []
    
    foundfiles = getrelevantfiles(backintime)
    
    if foundfiles:
        # the actual work, first reports per file, add them to reportlist
        for f in foundfiles:
            # create report per file
            # get lines that match condition, match the end/start
            lines = timeselect(readfile(f))
            # get the (time) relevant lines inside the file
            for item in lines:
                # convert stamp to section
                item[0] = convert_tosection(item[0])
            lines.sort(key=lambda x: x[0])
            for item, occurrence in groupby(lines, itemgetter(0)):
                occ = list(occurrence)
                total = len(occ)
                # ifilter is python2 specific (<> filterfalse in 3)
                success = len(list(ifilter(lambda x: x[1].strip() == "00", occ)))
                fails = total-success
                reportlist.append([item, total, success, fails])
    
        finalreport = []
    
        # then group the reports per file into one
        reportlist.sort(key=lambda x: x[0])
        for item, occurrence in groupby(reportlist, itemgetter(0)):
            occ = [it[1:] for it in list(occurrence)]
            output = [str(sum(i)) for i in zip(*occ)]
            output.insert(0, item)
            finalreport.append(output)
    
        # create timeframe to fill up emty sections
        framestart = int(convert_toepoch(readable, start)/interval)*interval
        frameend = int(math.ceil(convert_toepoch(readable, end)/interval))*interval
        timerange = list(range(framestart, frameend, interval))
        currlisted = [r[0] for r in finalreport]
        extra = [item for item in timerange if not item in currlisted]
    
        # add missing time sections
        for item in extra:
            finalreport.append([item, 0, 0, 0])
        finalreport.sort(key=lambda x: x[0])
        print(str(len(finalreport))+" timesections reported")
    
        # define output file
        fname1 = convert_toreadable(
            readable,
            convert_toepoch(readable, start),
            timeshift) 
        fname2 = convert_toreadable(
            readable,
            convert_toepoch(readable, end),
            timeshift)
        filename = "report_"+fname1+"_"+fname2
        outputfile = os.path.join(output_path, filename)
        # edit the time stamp into the desired output format, add time shift
        with open(outputfile, "wt") as report:
            report.write(set_layout(["starttime", "total", "SUCCESS", "FAILED"])+"\n")
            for item in finalreport:
                item[0] = convert_toreadable(outputformat, item[0], timeshift)
                report.write(set_layout(item)+"\n")
        if csv:
            with open(outputfile+".csv", "wt") as csv_file:
                csv_file.write(",".join(["starttime", "total", "SUCCESS", "FAILED"])+"\n")
                for item in finalreport:
                    csv_file.write(",".join(item)+"\n")
    
    else:
        print("no files to read")
    

    输出的小样本

    starttime               total          SUCCESS        FAILED
    12-12-2016 03:30        2029           682            1347
    12-12-2016 03:45        2120           732            1388
    12-12-2016 04:00        2082           745            1337
    12-12-2016 04:15        2072           710            1362
    12-12-2016 04:30        2004           700            1304
    12-12-2016 04:45        2110           696            1414
    12-12-2016 05:00        2148           706            1442
    12-12-2016 05:15        2105           704            1401
    12-12-2016 05:30        2040           620            1420
    12-12-2016 05:45        2030           654            1376
    12-12-2016 06:00        2067           692            1375
    12-12-2016 06:15        2079           648            1431
    12-12-2016 06:30        2030           706            1324
    12-12-2016 06:45        2085           713            1372
    12-12-2016 07:00        2064           726            1338
    12-12-2016 07:15        2113           728            1385
    
    • 7

相关问题

  • 如何从命令行仅安装安全更新?关于如何管理更新的一些提示

  • 如何从命令行刻录双层 dvd iso

  • 如何从命令行判断机器是否需要重新启动?

  • 文件权限如何工作?文件权限用户和组

  • 如何在 Vim 中启用全彩支持?

Sidebar

Stats

  • 问题 205573
  • 回答 270741
  • 最佳答案 135370
  • 用户 68524
  • 热门
  • 回答
  • Marko Smith

    如何运行 .sh 脚本?

    • 16 个回答
  • Marko Smith

    如何安装 .tar.gz(或 .tar.bz2)文件?

    • 14 个回答
  • Marko Smith

    如何列出所有已安装的软件包

    • 24 个回答
  • Marko Smith

    无法锁定管理目录 (/var/lib/dpkg/) 是另一个进程在使用它吗?

    • 25 个回答
  • Martin Hope
    Flimm 如何在没有 sudo 的情况下使用 docker? 2014-06-07 00:17:43 +0800 CST
  • Martin Hope
    Ivan 如何列出所有已安装的软件包 2010-12-17 18:08:49 +0800 CST
  • Martin Hope
    La Ode Adam Saputra 无法锁定管理目录 (/var/lib/dpkg/) 是另一个进程在使用它吗? 2010-11-30 18:12:48 +0800 CST
  • Martin Hope
    David Barry 如何从命令行确定目录(文件夹)的总大小? 2010-08-06 10:20:23 +0800 CST
  • Martin Hope
    jfoucher “以下软件包已被保留:”为什么以及如何解决? 2010-08-01 13:59:22 +0800 CST
  • Martin Hope
    David Ashford 如何删除 PPA? 2010-07-30 01:09:42 +0800 CST

热门标签

10.10 10.04 gnome networking server command-line package-management software-recommendation sound xorg

Explore

  • 主页
  • 问题
    • 最新
    • 热门
  • 标签
  • 帮助

Footer

AskOverflow.Dev

关于我们

  • 关于我们
  • 联系我们

Legal Stuff

  • Privacy Policy

Language

  • Pt
  • Server
  • Unix

© 2023 AskOverflow.DEV All Rights Reserve