如何改进这个字符转换脚本？

Question

Evgeniy Berezovsky

Asked: 2017-12-18 16:07:40 +0800 CST2017-12-18 16:07:40 +0800 CST 2017-12-18 16:07:40 +0800 CST

过滤与特定 ID 匹配的 xml 文档

772

假设您有一个包含许多 xml 文档的文件，例如

<a>
  <b>
  ...
</a>
in between xml documents there may be plain text log messages
<x>
  ...
</x>

...

我将如何过滤此文件以仅显示给定正则表达式与该 xml 文档的任何一行匹配的那些 xml 文档？我在这里谈论的是一个简单的文本匹配，因此正则表达式匹配部分也可能完全不了解底层格式 - xml。

你可以假设根元素的开始和结束标签总是在它们自己的行上（尽管可能会被空白填充），并且它们只用作根元素，即同名的标签不会出现在下面根元素。这应该可以完成工作，而不必求助于 xml 感知工具。

1 个回答

Voted

igal · Answer 1 · 2017-12-18T16:53:10+08:00

概括

我写了一个 Python 解决方案、一个 Bash 解决方案和一个 Awk 解决方案。所有脚本的想法都是一样的：逐行检查并使用标志变量来跟踪状态（即我们当前是否在 XML 子文档中以及我们是否找到了匹配的行)。

在 Python 脚本中，我将所有行读入一个列表，并跟踪当前 XML 子文档开始的列表索引，以便在到达结束标记时打印出当前子文档。我检查每一行的正则表达式模式，并使用一个标志来跟踪我们完成处理后是否输出当前子文档。

在 Bash 脚本中，我使用一个临时文件作为缓冲区来存储当前的 XML 子文档，并等待它完成写入，然后再使用grep它来检查它是否包含与给定正则表达式匹配的行。

Awk 脚本类似于 Base 脚本，但我使用 Awk 数组作为缓冲区而不是文件。

测试数据文件

data.xml我根据您问题中给出的示例数据，对照以下数据文件 ( ) 检查了这两个脚本：

<a>
  <b>
    string to search for: stuff
  </b>
</a>
in between xml documents there may be plain text log messages
<x>
    unicode string: øæå
</x>

Python 解决方案

这是一个简单的 Python 脚本，可以满足您的要求：

#!/usr/bin/env python2
# -*- encoding: ascii -*-
"""xmlgrep.py"""

import sys
import re

invert_match = False

if sys.argv[1] == '-v' or sys.argv[1] == '--invert-match':
    invert_match = True
    sys.argv.pop(0)

regex = sys.argv[1]

# Open the XML-ish file
with open(sys.argv[2], 'r') if len(sys.argv) > 2 else sys.stdin as xmlfile:

    # Read all of the data into a list
    lines = xmlfile.readlines()

    # Use flags to keep track of which XML subdocument we're in
    # and whether or not we've found a match in that document
    start_index = closing_tag = regex_match = False

    # Iterate through all the lines
    for index, line in enumerate(lines):

        # Remove trailing and leading white-space
        line = line.strip()

        # If we have a start_index then we're inside an XML document
        if start_index is not False:

            # If this line is a closing tag then reset the flags
            # and print the document if we found a match
            if line == closing_tag:
                if regex_match != invert_match:
                    print(''.join(lines[start_index:index+1]))
                start_index = closing_tag = regex_match = False

            # If this line is NOT a closing tag then we
            # search the current line for a match
            elif re.search(regex, line):
                regex_match = True

        # If we do NOT have a start_index then we're either at the
        # beginning of a new XML subdocument or we're inbetween
        # XML subdocuments
        else:

            # Check for an opening tag for a new XML subdocument
            match = re.match(r'^<(\w+)>$', line)
            if match:

                # Store the current line number
                start_index = index

                # Construct the matching closing tag
                closing_tag = '</' + match.groups()[0] + '>'

以下是运行脚本以搜索字符串“stuff”的方式：

python xmlgrep.py stuff data.xml

这是输出：

<a>
  <b>
    string to search for: stuff
  </b>
</a>

以下是您如何运行脚本来搜索字符串“øæå”：

python xmlgrep.py øæå data.xml

这是输出：

<x>
    unicode string: øæå
</x>

您还可以指定-v或--invert-match搜索不匹配的文档，并使用标准输入：

cat data.xml | python xmlgrep.py -v stuff

重击解决方案

这是相同基本算法的 bash 实现。它使用标志来跟踪当前行是否属于 XML 文档，并使用临时文件作为缓冲区来存储正在处理的每个 XML 文档。

#!/usr/bin/env bash
# xmlgrep.sh

# Get the filename and search pattern from the command-line
FILENAME="$1"
REGEX="$2"

# Use flags to keep track of which XML subdocument we're in
XML_DOC=false
CLOSING_TAG=""

# Use a temporary file to store the current XML subdocument
TEMPFILE="$(mktemp)"

# Reset the internal field separator to preserver white-space
export IFS=''

# Iterate through all the lines of the file
while read LINE; do

    # If we're already in an XML subdocument then update
    # the temporary file and check to see if we've reached
    # the end of the document
    if "${XML_DOC}"; then

        # Append the line to the temp-file
        echo "${LINE}" >> "${TEMPFILE}"

        # If this line is a closing tag then reset the flags
        if echo "${LINE}" | grep -Pq '^\s*'"${CLOSING_TAG}"'\s*$'; then
            XML_DOC=false
            CLOSING_TAG=""

            # Print the document if it contains the match pattern 
            if grep -Pq "${REGEX}" "${TEMPFILE}"; then
                cat "${TEMPFILE}"
            fi
        fi

    # Otherwise we check to see if we've reached
    # the beginning of a new XML subdocument
    elif echo "${LINE}" | grep -Pq '^\s*<\w+>\s*$'; then

        # Extract the tag-name
        TAG_NAME="$(echo "${LINE}" | sed 's/^\s*<\(\w\+\)>\s*$/\1/;tx;d;:x')"

        # Construct the corresponding closing tag
        CLOSING_TAG="</${TAG_NAME}>"

        # Set the XML_DOC flag so we know we're inside an XML subdocument
        XML_DOC=true

        # Start storing the subdocument in the temporary file 
        echo "${LINE}" > "${TEMPFILE}"
    fi
done < "${FILENAME}"

以下是运行脚本来搜索字符串“stuff”的方法：

bash xmlgrep.sh data.xml 'stuff'

这是相应的输出：

<a>
  <b>
    string to search for: stuff
  </b>
</a>

以下是您可以如何运行脚本来搜索字符串“øæå”：

bash xmlgrep.sh data.xml 'øæå'

这是相应的输出：

<x>
    unicode string: øæå
</x>

awk 解决方案

这是一个awk解决方案 - 虽然我awk的不是很好，所以它很粗糙。它使用与 Bash 和 Python 脚本相同的基本思想。它将每个 XML 文档存储在一个缓冲区（一个awk数组）中，并使用标志来跟踪状态。当它完成处理一个文档时，如果它包含与给定正则表达式匹配的任何行，它就会打印它。这是脚本：

#!/usr/bin/env gawk
# xmlgrep.awk

# Variables:
#
#   XML_DOC
#       XML_DOC=1 if the current line is inside an XML document.
#
#   CLOSING_TAG
#       Stores the closing tag for the current XML document.
#
#   BUFFER_LENGTH
#       Stores the number of lines in the current XML document.
#
#   MATCH
#       MATCH=1 if we found a matching line in the current XML document.
#
#   PATTERN
#       The regular expression pattern to match against (given as a command-line argument).
#

# Initialize Variables
BEGIN{
    XML_DOC=0;
    CLOSING_TAG="";
    BUFFER_LENGTH=0;
    MATCH=0;
}
{
    if (XML_DOC==1) {

        # If we're inside an XML block, add the current line to the buffer
        BUFFER[BUFFER_LENGTH]=$0;
        BUFFER_LENGTH++;

        # If we've reached a closing tag, reset the XML_DOC and CLOSING_TAG flags
        if ($0 ~ CLOSING_TAG) {
            XML_DOC=0;
            CLOSING_TAG="";

            # If there was a match then output the XML document
            if (MATCH==1) {
                for (i in BUFFER) {
                    print BUFFER[i];
                }
            }
        }
        # If we found a matching line then update the MATCH flag
        else {
            if ($0 ~ PATTERN) {
                MATCH=1;
            }
        }
    }
    else {

        # If we reach a new opening tag then start storing the data in the buffer
        if ($0 ~ /<[a-z]+>/) {

            # Set the XML_DOC flag
            XML_DOC=1;

            # Reset the buffer
            delete BUFFER;
            BUFFER[0]=$0;
            BUFFER_LENGTH=1;

            # Reset the match flag
            MATCH=0;

            # Compute the corresponding closing tag
            match($0, /<([a-z]+)>/, match_groups);
            CLOSING_TAG="</" match_groups[1] ">";
        }
    }
}

这是您的称呼：

gawk -v PATTERN="øæå" -f xmlgrep.awk data.xml

这是相应的输出：

<x>
    unicode string: øæå
</x>

过滤与特定 ID 匹配的 xml 文档

概括

测试数据文件

Python 解决方案

重击解决方案

awk 解决方案

JSON数组使用jq来bash变量

日期可以为 GMT 时区格式化当前时间吗？[复制]

bash + 通过 bash 脚本从文件中读取变量和值

如何复制目录并在同一命令中重命名它？

ssh 连接。X11 连接因身份验证错误而被拒绝

如何下载软件包而不是使用 apt-get 命令安装它？

systemctl 命令在 RHEL 6 中不起作用

rsync 端口 22 和 873 使用

以 100% 的利用率捕捉 /dev/loop -- 没有可用空间

jq 打印子对象中所有的键和值

过滤与特定 ID 匹配的 xml 文档

1 个回答

概括

测试数据文件

Python 解决方案

重击解决方案

awk 解决方案

相关问题