grep 从 $START 到 $END 的一组行并且在 $MIDDLE 中包含匹配项

Question

WashichawbachaW

Asked: 2018-01-05 01:12:48 +0800 CST2018-01-05 01:12:48 +0800 CST 2018-01-05 01:12:48 +0800 CST

文本处理 - 如何从文件中按顺序获取多个模式

772

我有这个file.txt.Z包含这个：

AK2*856*1036~AK3*TD1*4**~AK4*2**1*~AK4*7**1*~AK3*TD5*5**~AK4*3**6*2~AK3*REF*6**~AK4*2**1*~AK3*REF*7**~AK4*2**1*~AK3*REF*8**~AK4*2**1*~AK3*DTM*9**~AK4*2**4*20~AK4*2**4*20~AK3*CTT*12**7~AK5*R
AK2*856*1037~AK3*HL*92**~AK4*3**7*O~AK5*R~AK9*R*2*2*0~SE*25*0001~GE*1*211582~IEA*1*000211582

每条记录都包含几个以标头（通常带有数字）开头的字段AK，以 . 分隔~。如果你用~缩进的换行符替换它，它将显示为：

AK2*856*1036
  AK3*TD1*4**
  AK4*2**1*
  AK4*7**1*
  AK3*TD5*5**
  AK4*3**6*2
  AK3*REF*6**
  AK4*2**1*
  AK3*REF*7**
  AK4*2**1*
  AK3*REF*8**
  AK4*2**1*
  AK3*DTM*9**
  AK4*2**4*20
  AK4*2**4*20
  AK3*CTT*12**7
  AK5*R
AK2*856*1037
  AK3*HL*92**
  AK4*3**7*O
  AK5*R
  AK9*R*2*2*0
  SE*25*0001
  GE*1*211582
  IEA*1*000211582

每个字段都有由分隔的子字段*。例如，子字段AK201是标题之后的第一个字段AK2，因此它856用于示例行。

如您所见，有 2 行的起始字符串为AK2. 这就像一个行标题，或者我们称之为段标题。中有两个段头file.txt.Z。我想要的是按顺序从每个段标题中获取这些数据：

所需数据：

AK202（标题后的第二个字段AK2）-AK2*856*this_numeric_value在星号或~.之前
AK301（标题后的第一个字段AK3）-~AK3*this_string_value在*or之前~。
AK502（标题后的第二个字段AK5）-~AK5*some_string_value*this_numeric_value在*or之前~。
AK401（标题后的第一个字段AK4）-~AK4*this_numeric_value在*or之前~。
AK4来自或字段的每个数值AK5都应始终至少为 2 位。例如 AK502 = 2；AK502 = 02 或 AK401 = 9；AK401 = 09。
如果没有AK3字段，则不输出任何内容。（我已经有一个脚本）
如果一行包含多个 AK3-AK5-AK4 序列，它们应该与空格连接
如果该AK5字段在该字段之后丢失，请改为AK3查找字段。AK4
如果字段之后既没有an也AK4没有字段，则只输出AK301（AK3头之后的第一个字段）。AK5AK3
如果一个AK4字段后有多个AK3字段，请用逗号连接 AK502-AK401-sequences

输出：

GS: 1036 - TD102,07 TD503 REF02 DTM02,02 CTT
GS: 1037 - HL03

这个怎么做？只要问我你是否对我的问题感到困惑。

编辑：这是我的代码：这是在一个while循环中

while read FILE
do
    AK2=`zgrep -oP 'AK2.[\w\s\d]*.\K[\w\s\d]*' < $FILE`
    AK3=`zgrep -oP 'AK3.\K[\w\s\d]*' < $FILE`
    AK5=`zgrep -oP 'AK5.[\w\s\d]*.\K[\w\s\d]' < $FILE`
    AK5_ERROR=`if [[ $AK5 =~ ^[0-9]+$ ]]; then  printf "%02d" $AK5 2> /dev/null; else 2> /dev/null; fi`
    AK4=`zgrep -oP 'AK4.\K[\w\s\d]*' < $FILE`
    AK4_ERROR=`if [[ $AK4 =~ ^[0-9]+$ ]]; then  printf "%02d" $AK4 2> /dev/null; else 2> /dev/null; fi`

    if [[ $AK3 ]]
    then
        if $AK5 2> /dev/null
        then
            echo "GS: $AK2 - $AK3$AK4_ERROR"
        else
            echo "GS: $AK2 - $AK3$AK5_ERROR"
        fi
    else
        echo "Errors are not specified in the file."
    fi
done < file.txt.Z

我的原始代码的问题是它没有连接$AK3and, $AK5or $AK4。

2 个回答

Voted

cas · Answer 1 · 2018-01-05T22:13:44+08:00

以下 perl 脚本会在给定示例输入时准确生成示例输出。

它可能无法完全按照您对真实数据文件的要求工作，但它并未作为完整的工作解决方案呈现。它作为开始工作的基础 - 玩脚本，弄乱它，破坏它，修复它，改变它来做你想做的事。

毫无疑问，它远非最佳，但如果没有更详细的知识/更好地解释您的输入数据和所需的输出，将很难对其进行改进。

它处理每个输入行（也称为“记录”或使用您的术语的“段”）并构建一个字符串以在处理记录后打印出来。每条输出线都是根据您在问题的“所需数据”部分中的规范构建的。

#!/usr/bin/perl

use strict;

while(<>) {
  next unless /AK3/;  # skip lines that don't contain AK3

  # process each "segment" aka "record".
  my @fields = split /~/;

  # get segment "header" and 2nd sub-field of that header.
  my @segment = split(/\*/,$fields[0]);
  my $segment_header = $segment[2];
  shift @fields;

  my $output = "GS: $segment_header -";

  my $groupoutput = ''; # output for a given AK3 "group"
  my $last_go = ''; # used to avoid duplicates like "REF02 REF02 REF02"

  foreach my $f (@fields) {
    my @subfields = split /\*/,$f;

    if ($f =~ m/^AK3/) {

        if (($groupoutput) && ($groupoutput ne $last_go)) {
          $output .= " $groupoutput";
          $last_go = $groupoutput;  # remember the most recent $groupoutput
        };

        $groupoutput = $subfields[1];

    } elsif ($f =~ m/^AK4/) {
        my $ak401 = $subfields[1];
        $groupoutput .= sprintf("%02i,",$ak401) if ($ak401 > 0);
    } elsif ($f =~ m/^AK5/) {
        my $ak502 = $subfields[2];
        $groupoutput .= sprintf("%02i",$ak502) if ($ak502 > 0);
    };
  };

  # append the group output generated since the last seen AK3 (if any)
  # i.e. don't forget to print the final group on the line.
  $output .= " $groupoutput" if (($groupoutput) && ($groupoutput ne $last_go));

  # clean up output string before printing.
  $output =~ s/, / /g;
  $output =~ s/\s*$|,$//;

  print $output, "\n";
}

我保存这个脚本是mysteryprocess.pl因为我想不出更合适的名字。然后我用您的示例数据运行它（在一个名为的文件中input）：

示例输出：

$ ./mysteryprocess.pl input 
GS: 1036 - TD102,07 TD503 REF02 DTM02,02 CTT
GS: 1037 - HL03

那个“REF02 REF03 REF02”让我很困扰，所以这里有另一个版本。这个使用一个数组和一个哈希（@groupsand %groups）来构建输出行，另一个哈希（%gseen）通过记住我们已经看到并包含在输出中的值来防止记录中的重复。

组数据存储在中%groups，但哈希在中是无序的perl，因此该@groups数组用于记住我们第一次看到特定组的顺序。

顺便说一句，可能应该是数组散列，也就是 HoA（即在每个元素中都包含一个数组的散列），这样可以避免在打印之前%groups进行清理（通过使用 perl 的函数而不是简单地附加一个逗号和字符串的新值）。但是我认为这个脚本已经足够复杂了，对于 perl 的新手来说已经足够理解了。$outputjoin()

#!/usr/bin/perl

use strict;

while(<>) {
  next unless /AK3/;  # skip lines that don't contain AK3

  # process each "segment" aka "record".
  my @fields = split /~/;

  # get segment "header" from 1st field,  and then 2nd sub-field of that header.
  # NOTE: "shift" returns the first field of an array AND removes it from
  # the array.
  my @segment = split(/\*/, shift @fields);
  my $segment_header = $segment[2];

  my $output = "GS: $segment_header -";

  my @groups=(); # array to hold each group name (ak301) in the order that
                 # we see them
  my %groups=(); # hash to hold the ak401/ak502 values for each group
  my %gseen =(); # used to avoid dupes by holding specific values of ak301+ak401
                 # and ak301+ak502 that we've seen before.

  my $ak301='';

  foreach my $f (@fields) {
    my @subfields = split /\*/, $f;

    if ($f =~ m/^AK3/) {

        $ak301 = $subfields[1];
        if (!defined($groups{$ak301})) {
          push @groups, $ak301;
        };

    } elsif ($f =~ m/^AK4/) {

        my $ak401 = sprintf("%02i",$subfields[1]);
        $ak401 = '' if ($ak401 == 0);
        next if ($gseen{$ak301.'ak4'.$ak401});

        if (!defined($groups{$ak301})) {
          $groups{$ak301} = $ak401;
        } else {
          $groups{$ak301} .= ',' . $ak401;
        };
        $gseen{$ak301.'ak4'.$ak401}++;

    } elsif ($f =~ m/^AK5/) {

        my $ak502 = sprintf("%02i",$subfields[1]);
        $ak502 = '' if ($ak502 == 0);
        next if ($gseen{$ak301.'ak5'.$ak502});

        if (!defined($groups{$ak301})) {
          $groups{$ak301} = $ak502;
        } else {
          $groups{$ak301} .= ',' . $ak502;
        };
        $gseen{$ak301.'ak5'.$ak502}++;

    };
  };

  # construct the output string in the order we first saw each group
  foreach my $group (@groups) {
    $output .= " $group" . $groups{$group};
  };

  # clean up output string before printing.
  $output =~ s/, |  +/ /g;
  $output =~ s/\s*$|,$//;

  print $output, "\n";
}

使用以下输入

AK2*856*1036~AK3*TD1*4**~AK4*2**1*~AK4*7**1*~AK3*TD5*5**~AK4*3**6*2~AK3*REF*6**~AK4*2**1*~AK3*REF*7**~AK4*2**1*~AK3*REF*8**~AK4*2**1*~AK3*DTM*9**~AK4*2**4*20~AK4*2**4*20~AK3*CTT*12**7~AK5*R
AK2*856*1037~AK3*HL*92**~AK4*3**7*O~AK5*R~AK9*R*2*2*0~SE*25*0001~GE*1*211582~IEA*1*000211582
AK2*856*1099~AK3*TD1*4**~AK4*2**1*~AK4*7**1*~AK3*TD5*5**~AK4*3**6*2~AK3*REF*6**~AK4*2**1*~AK3*REF*7**~AK4*2**1*~AK3*REF*8**~AK4*3**1*~AK3*REF*8**~AK4*2**1*~AK3*DTM*9**~AK4*2**4*20~AK4*2**4*20~AK3*CTT*12**7~AK5*R

现在的输出是：

$ ./mysteryprocess.pl input 
GS: 1036 - TD102,07 TD503 REF02 DTM02 CTT
GS: 1037 - HL03
GS: 1099 - TD102,07 TD503 REF02,03 DTM02 CTT

笔记：

DTM02,02也塌陷成刚才DTM02。消除重复现在发生在所有事情上。
无论元素出现在记录中的哪个位置，组（即具有相同 AK301“名称”的元素）的合并也会发生。以前的版本只合并相邻的字段/子字段，如果它们是相同的。

我不确定这些更改是否是您想要的。

ps：如果你没有perl安装，这段代码很容易翻译成awk. 这是一个非常简单（甚至简单化）、直接的算法。

Guy · Answer 2 · 2018-01-06T10:46:28+08:00

另一个去，显示一个 awk 版本，正如 cas 建议的那样。可能可以做得更整洁，但无论如何都是一种学习体验。

#!/usr/bin/awk -f

function get_slice(elem, fc,       tmpArr) {
        split(elem, tmpArr, "*")
        return tmpArr[fc]
    }

BEGIN { FS="~" }

/AK2/ { 
    res = get_slice($1, 3) " - "
    tmpStr = ""
    # only continue with this line if there are any AK3 fields.
    # otherwise may as well skip whole thing.
    if (match($0, /AK3/)) {
        loc=2
        for (loc=2; loc<=NF; loc++)
            if ($loc ~ /AK3/) break

        for ( ; loc<=NF; loc++) {
            if ($loc ~ /AK3/) {
                # check to see whether the previous loop generated a duplicate
                # tmpStr will be "" the first time
                if (index(res, tmpStr) == 0)
                    res = res " " tmpStr
                tmpStr = get_slice($loc, 2)
                # c is a count of how many fields have been added after AK3.
                # once positive, "," will be added.
                c = 0
                }
            # add the other fields
            else if ($loc ~ /AK4/) { 
                if ((s = get_slice($loc, 2)) != "")
                    tmpStr = tmpStr sprintf("%s%02d", c++ ? "," : "", s) 
            } else if ($loc ~ /AK5/) { 
                if ((s = get_slice($loc, 3)) != "")
                    tmpStr = tmpStr sprintf("%s%02d", c++ ? "," : "", s) 
            }
        }
        # this is repeated at the end, to make sure the final set is printed.
        if (index(res, tmpStr) == 0)
            res = res " " tmpStr
        print res
        }
    }

只需最初在“~”上拆分字段，然后循环遍历每行的所有可用字段。只有当需要一个字段时，才会将其拆分为“*”上的子字段以获取所要求的元素。如果没有找到，'get_slice' 返回 ""，所以必须检查。

我想我已经理解了这个问题..

文本处理 - 如何从文件中按顺序获取多个模式

如何将 GPG 私钥和公钥导出到文件

ssh 无法协商：“找不到匹配的密码”，正在拒绝 cbc

我们如何运行存储在变量中的命令？

如何配置 systemd-resolved 和 systemd-networkd 以使用本地 DNS 服务器来解析本地域和远程 DNS 服务器来解析远程域？

如何卸载内核模块“nvidia-drm”？

dist-upgrade 后 Kali Linux 中的 apt-get update 错误 [重复]

如何从 systemctl 服务日志中查看最新的 x 行

Nano - 跳转到文件末尾

grub 错误：你需要先加载内核

如何下载软件包而不是使用 apt-get 命令安装它？

文本处理 - 如何从文件中按顺序获取多个模式

2 个回答

相关问题