在另一个文件之后逐行追加行

Question

Ramón Wilhelm

Asked: 2022-02-01 04:39:16 +0800 CST2022-02-01 04:39:16 +0800 CST 2022-02-01 04:39:16 +0800 CST

AWK：在字典中的源术语之后随机选择行插入目标术语

772

注意：我已经在AWK 中问过一个类似的问题：Quick way to insert target words after an source term，我是 AWK 的初学者。

这个问题考虑在随机选择的行中在源词之后插入多个目标词。

有了这个 AWK 代码片段

awk '(NR==FNR){a[$1];next}
    FNR in a { gsub(/\<source term\>/,"& target term") }
     1
    ' <(shuf -n 5 -i 1-$(wc -l < file)) file

我想target term在.source termfile

例如：我有一个双语词典dict，其中包含左侧的源术语和右侧的目标术语，例如

apple     : Apfel
banana    : Banane
raspberry : Himbeere

我的file由以下几行组成：

I love the Raspberry Pi.
The monkey loves eating a banana.
Who wants an apple pi?
Apple pen... pineapple pen... pen-pineapple-apple-pen!
The banana is tasty and healthy.
An apple a day keeps the doctor away.
Which fruit is tastes better: raspberry or strawberry?

假设第一个单词apple随机选择第 1、3、5、4、7 行。带有单词 apple 的输出将如下所示：

I love the Raspberry Pi.
The monkey loves eating a banana.
Who wants an apple Apfel pi?
Apple Apfel pen... pineapple pen... pen-pineapple-apple-pen!
The banana is tasty and healthy.
An apple a day keeps the doctor away.
Which fruit is tastes better: raspberry or strawberry?

然后是另外 5 条随机线；3、3、5、6、7；对于单词banana将被选中：

I love the Raspberry Pi .
The monkey loves eating a banana .
Who wants an apple Apfel pi ?
Apple Apfel pen... pineapple pen... pen-pineapple-apple-pen!
The banana Banane is tasty and healthy .
An apple a day keeps the doctor away .
Which fruit is tastes better: raspberry or strawberry?

dict在匹配最后一个条目之前，所有其他条目也是如此。

我想选择 5 条随机线。如果这些行有一个完整的源术语，比如我apple只想匹配整个单词（诸如“菠萝”之类的术语将被忽略）。如果一行包含两次源术语，例如，那么我也想在它之后插入目标术语。匹配应该不区分大小写，所以我也可以匹配源术语，比如and 。ApfelappleappleappleApple

我的问题：我怎样才能重写上面的代码片段，这样我就可以使用字典dict，它选择随机行file并在源术语后面插入目标术语？

2 个回答

Voted

Ed Morton · Answer 1 · 2022-02-01T15:23:23+08:00

以下是如何使用 awk 从输入文件中随机选择 5 个行号（第一次使用 wc 来计算行号）：

$ awk -v numLines="$(wc -l < file)" 'BEGIN{srand(); for (i=1; i<=5; i++) print int(1+rand()*numLines)}'
7
2
88
13
18

现在您所要做的就是接受我之前的答案，并且对于ARGIND==1块中读取的每个“旧”字符串生成 5 个行号，如上所示，填充一个数组，将生成的行号映射到与每个行号关联的旧字符串，并在读取最终输入文件时检查当前行号是否在数组中，如果是，则循环遍历存储在数组中该行号的“旧”，按照gsub()我之前的回答执行。

将 GNU awk 用于ARGIND、IGNORECASE、字边界、数组数组和的\s简写[[:space:]]：

$ cat tst.sh
#!/usr/bin/env bash

awk -v numLines=$(wc -l < file) '
    BEGIN {
        FS = "\\s*:\\s*"
        IGNORECASE = 1
        srand()
    }
    ARGIND == 1 {
        old = "\\<" $1 "\\>"
        new = "& " $2
        for (i=1; i<=5; i++) {
            lineNr = int(1+rand()*numLines)
            map[lineNr][old] = new
        }
        next
    }
    FNR in map {
        for ( old in map[FNR] ) {
            new = map[FNR][old]
            gsub(old,new)
        }
    }
    { print }
' dict file

$ ./tst.sh
I love the Raspberry Pi.
The monkey loves eating a banana Banane.
Who wants an apple Apfel pi?
Apple Apfel pen... pineapple pen... pen-pineapple-apple Apfel-pen!
The banana Banane is tasty and healthy.
An apple a day keeps the doctor away.
Which fruit is tastes better: raspberry Himbeere or strawberry?

guest_7 · Answer 2 · 2022-02-01T21:27:13+08:00

guest_7

2022-02-01T21:27:13+08:002022-02-01T21:27:13+08:00

带有扩展正则表达式模式 (-E) 和 s/// 命令的 (/e) 修饰符的 GNU sed：

n=$(< file wc -l)
sed -E '/\n/ba
  s#^(\S+)\s*:\s*(\S+)$#s/\\<\1\\>/\& \2/Ig#;h'"
  s/.*/shuf -n 5 -i '1-$n'/e;G
  :a
  s/^([0-9]+)(\n.*\n(.*))/\1 \3\2/
  /\n.*\n/!s/\n/ /
  P;D
" dict | sed -f /dev/stdin file

从管道文件的内容生成 GNU sed 命令。
将命令存储在保持中。
掷骰子并在输入文件的行长范围内生成 5 个随机数。
坚持模式并生成 sed 命令以仅在这些特定行上运行。
应用在输入文件上生成的这些命令。

1

AWK：在字典中的源术语之后随机选择行插入目标术语

模块 i915 可能缺少固件 /lib/firmware/i915/*

无法获取 jessie backports 存储库

如何将 GPG 私钥和公钥导出到文件

我们如何运行存储在变量中的命令？

如何配置 systemd-resolved 和 systemd-networkd 以使用本地 DNS 服务器来解析本地域和远程 DNS 服务器来解析远程域？

dist-upgrade 后 Kali Linux 中的 apt-get update 错误 [重复]

如何从 systemctl 服务日志中查看最新的 x 行

Nano - 跳转到文件末尾

grub 错误：你需要先加载内核

如何下载软件包而不是使用 apt-get 命令安装它？

AWK：在字典中的源术语之后随机选择行插入目标术语

2 个回答

相关问题