grep 从 $START 到 $END 的一组行并且在 $MIDDLE 中包含匹配项

Question

Asked: 2024-05-03 11:12:39 +0800 CST2024-05-03 11:12:39 +0800 CST 2024-05-03 11:12:39 +0800 CST

如何删除 .dat 文件中双引号字段值中的双引号

772

我有一个大约有 15 列的文本文件。字段之间用逗号分隔。作为描述的一列被双引号括起来，并且还有一些单词被双引号括起来。我需要保留开头和结尾的双引号，并仅删除内部双引号。

像这样的事情：

"Hi there, we are from XYZ team, we have an "Opportunity" at our organization"

我需要输出为：

"Hi there, we are from XYZ team, we have an Opportunity at our organization"

我不想继续Python编程。我一直在寻找 awk 命令或任何其他最佳选择。

该文件可能有 100 行数据，但此描述列对几行而非所有 100 行使用双引号。

这是一些示例数据：

invoice number,invoice date,vendor number,vendor site ID,supplier site CODE,invoice description,invoice currency code,invoice total amount,line number,line amount,line description,account code,business unit,business center,department,issue code,project,task number

1686,2024-03-28,258,9845,NEWYORK,CA Project: Content,USD,538,1,26,279.6,"Review new applications, and instruct the same.The deposits. Review correspondence applications. Review and applications. Research "Material Included"  and  artwork , and email. Communications with team website. Call, and communications.",230,,,,,295,10

我必须删除行描述中“包含材料”的双引号。

请注意：我需要整个文件并保留所有列，但只需删除行描述值中的内部双引号。只有行描述字段具有此类内部双引号值。就目前而言，只有一个内部双引号单词用于文件的行描述，我们还没有注意到超过一个。

4 个回答

Voted

Kusalananda · Answer 1 · 2024-05-03T23:27:57+08:00

注意：我没有使用问题中提供的数据，因为标题字段的数量似乎与数据字段的数量不匹配。相反，我使用printf与问题中描述的相同的引用问题来创建一个简单的数据集。

使用如下所示的Miller ( mlr)，您将能够将有问题的嵌入双引号转换为正确的 CSV 编码嵌入双引号。这包括将每个嵌入的双引号字符加倍：

$ printf '%s\n' a,b,c 'aaa,"bb "bb" bb","c"cc"'
a,b,c
aaa,"bb "bb" bb","c"cc"
$ printf '%s\n' a,b,c 'aaa,"bb "bb" bb","c"cc"' | mlr --csv --lazy-quotes cat
a,b,c
aaa,"bb ""bb"" bb","c""cc"

这将创建一个 CSV 文档，任何支持 CSV 的解析器都能够正确读取该文档，并保留嵌入的引号。

要完全删除嵌入的双引号，您可以像这样使用 Miller：

$ printf '%s\n' a,b,c 'aaa,"bb "bb" bb","c"cc"' | mlr --csv --lazy-quotes put 'for (k,v in $*) { $[k] = gssub(v, "\"", "") }'
a,b,c
aaa,bb bb bb,ccc

这用于mlr迭代所有记录中的所有字段并删除找到的任何双引号字符。

如果某个字段由于包含逗号而需要引用，那么 Miller 将引用它：

$ printf '%s\n' a,b,c 'aaa,"b,b "bb" bb","c"cc"' | mlr --csv --lazy-quotes put 'for (k,v in $*) { $[k] = gssub(v, "\"", "") }'
a,b,c
aaa,"b,b bb bb",ccc

米勒命令再次出现，但它本身是：

mlr --csv --lazy-quotes put 'for (k,v in $*) { $[k] = gssub(v, "\"", "") }'

如果您知道包含要删除的引号的字段的名称，例如line description，那么您可以简化命令并删除循环：

mlr --csv --lazy-quotes put '$["line description"] = gssub($["line description"], "\"", "")'

kos · Answer 2 · 2024-05-04T09:08:39+08:00

您可以将格式错误的文件转换为正确且标准的“双引号转义双引号字符”，而不是从输入中删除未转义的双引号（我想如果保留它们会更好）（我很抱歉) CSV，其中双引号 ( "") 用于在带引号的文本字段内转义它们。

这可以使用 Perl 的Text::CSV模块自动完成并寻址整个文件，而无需寻址特定的行/字段（默认情况下未安装在 Ubuntu 上，sudo apt install libtext-csv-perlIIRC默认安装在 openSUSE Tumbleweed 上（否则zypper se我猜测）；但无论如何，它是一个非常标准的模块，并且它应该在大多数/所有 Linux 发行版中可用；当然，它仍然可以通过 CPAN 安装在任何缺少它的系统上。

perl -Mstrict -M'Text::CSV qw(csv)' -we '
    csv(
        in => csv(
            in => "in",
            allow_loose_quotes => 1,
            escape_char => undef(),
        )
    );
'

-Mstrict，并且-w存在主要是因为包含它们是标准的（至少在编写更复杂的 Perl 脚本时），但在这种情况下并不真正需要它们。

其作用是：

它打开一个名为“in”的文件，将其读取为 CSV，而不将任何字符解释为默认的转义字符quote_character( ") （这是允许解析器"在默认的quote-character-delimited内部时将字符作为常规字符读取的技巧）文本字段边界）；这与结合allow_loose_quotes，告诉解析器在读取quote_character文本字段内的非转义默认值时不要抱怨，最终迫使它逐字读取文本字段的内容；然后使用标准选项（包括引用文本字段和在需要时在文本字段内加双引号）生成输出 CSV 并打印到 STDOUT。

% cat in
invoice number,invoice date,vendor number,vendor site ID,supplier site CODE,invoice description,invoice currency code,invoice total amount,line number,line amount,line description,account code,business unit,business center,department,issue code,project,task number

1686,2024-03-28,258,9845,NEWYORK,CA Project: Content,USD,538,1,26,,232130,,,,,2915,"Review new applications, and instruct the same.The deposits. Review correspondence applications. Review and applications. Research "Material Included"  and  artwork , and email. Communications with team website. Call, and communications.",230,,,,,295,10
% perl -Mstrict -M'Text::CSV qw(csv)' -we '
        csv(
                in => csv(
                        in => "in",
                        escape_char => undef(),
                        allow_loose_quotes => 1,
                )
        );
'
"invoice number","invoice date","vendor number","vendor site ID","supplier site CODE","invoice description","invoice currency code","invoice total amount","line number","line amount","line description","account code","business unit","business center",department,"issue code",project,"task number"

1686,2024-03-28,258,9845,NEWYORK,"CA Project: Content",USD,538,1,26,,232130,,,,,2915,"Review new applications, and instruct the same.The deposits. Review correspondence applications. Review and applications. Research ""Material Included""  and  artwork , and email. Communications with team website. Call, and communications.",230,,,,,295,10

dbran · Answer 3 · 2024-05-04T01:36:22+08:00

只要每行最多有一个带引号的字段，您就可以尝试使用sed及其分支功能，这使您可以更好地控制何时进行替换：

#!/bin/sh

re='"\([^"]*\)"\([^"]*\)"'
sub='"\1\2"'

sed ":b;s/$re/$sub/g;tb" file.csv

或者直接从命令行：

$ sed ':b;s/"\([^"]*\)"\([^"]*\)"/"\1\2"/g;tb' file.csv

如果它给出了预期的结果，您可以使用该-i标志将更改应用到文件。

有关更多信息，请查看 GNU 手册：6.4 分支和流控制。

Ed Morton · Answer 4 · 2024-05-04T04:56:17+08:00

如果每行最多只能有 1 个带引号的字段，那么您可以使用任何 awk 执行以下操作：

$ awk '
    match($0,/".*"/) {
        fld = substr($0,RSTART+1,RLENGTH-2)
        gsub(/"/,"",fld)
        $0 = substr($0,1,RSTART) fld substr($0,RSTART+RLENGTH-1)
    }
    { print }
' file
"Hi there, we are from XYZ team, we have an Opportunity at our organization"
invoice number,invoice date,vendor number,vendor site ID,supplier site CODE,invoice description,invoice currency code,invoice total amount,line number,line amount,line description,account code,business unit,business center,department,issue code,project,task number
1686,2024-03-28,258,9845,NEWYORK,CA Project: Content,USD,538,1,26,,232130,,,,,2915,"Review new applications, and instruct the same.The deposits. Review correspondence applications. Review and applications. Research Material Included  and  artwork , and email. Communications with team website. Call, and communications.",230,,,,,295,10

\n或与任何解释为换行符的 sed 一起使用（否则使用\<literal newline>）：

$ sed 's/"\(.*\)"/\n\1\n/; s/"//g; s/\n/"/g' file
"Hi there, we are from XYZ team, we have an Opportunity at our organization"
invoice number,invoice date,vendor number,vendor site ID,supplier site CODE,invoice description,invoice currency code,invoice total amount,line number,line amount,line description,account code,business unit,business center,department,issue code,project,task number
1686,2024-03-28,258,9845,NEWYORK,CA Project: Content,USD,538,1,26,,232130,,,,,2915,"Review new applications, and instruct the same.The deposits. Review correspondence applications. Review and applications. Research Material Included  and  artwork , and email. Communications with team website. Call, and communications.",230,,,,,295,10

如果每行可以有超过 1 个带引号的字段，那么如果没有有关如何识别字段内和字段周围的引号的附加信息，则使用任何工具都无法稳健地完成这项工作。

以上是在根据问题中的示例行构建的输入文件上运行的：

$ cat file
"Hi there, we are from XYZ team, we have an "Opportunity" at our organization"
invoice number,invoice date,vendor number,vendor site ID,supplier site CODE,invoice description,invoice currency code,invoice total amount,line number,line amount,line description,account code,business unit,business center,department,issue code,project,task number
1686,2024-03-28,258,9845,NEWYORK,CA Project: Content,USD,538,1,26,,232130,,,,,2915,"Review new applications, and instruct the same.The deposits. Review correspondence applications. Review and applications. Research "Material Included"  and  artwork , and email. Communications with team website. Call, and communications.",230,,,,,295,10

如何删除 .dat 文件中双引号字段值中的双引号

模块 i915 可能缺少固件 /lib/firmware/i915/*

无法获取 jessie backports 存储库

如何将 GPG 私钥和公钥导出到文件

我们如何运行存储在变量中的命令？

如何配置 systemd-resolved 和 systemd-networkd 以使用本地 DNS 服务器来解析本地域和远程 DNS 服务器来解析远程域？

dist-upgrade 后 Kali Linux 中的 apt-get update 错误 [重复]

如何从 systemctl 服务日志中查看最新的 x 行

Nano - 跳转到文件末尾

grub 错误：你需要先加载内核

如何下载软件包而不是使用 apt-get 命令安装它？

如何删除 .dat 文件中双引号字段值中的双引号

4 个回答

相关问题