grep 从 $START 到 $END 的一组行并且在 $MIDDLE 中包含匹配项

Question

Manfredo

Asked: 2020-02-05 08:56:31 +0800 CST2020-02-05 08:56:31 +0800 CST 2020-02-05 08:56:31 +0800 CST

转换书目参考以与 Latex 一起使用

772

我收到了一份长字文档，我必须将其移植到 Latex 中。在文档中，所有引文都以带有作者和年份的经典形式出现。就像是

Lorem ipsum dolor (Sit, 1998) amet, consectetur adipiscing (Slit 2000, Sed and So 2002, Eiusmod et al. 1976).
Tempor incididunt ut labore et dolore magna aliqua (Ut et al. 1312)

此引用需要获得正确的关键引用，因为它出现在围兜引用列表中。换句话说，文本应该翻译成

Lorem ipsum dolor \cite{sit1998} amet, consectetur adipiscing \cite{slit2000,sed2002,eiusmod1976}.
Tempor incididunt ut labore et dolore magna aliqua \cite{ut1312}

这意味着：

提取由括号中的名称和年份组成的所有字符串
去掉那一串空格、第二个名字（名字后面的所有内容）和大写字母
使用生成的字符串形成新的 \cite{string}

我知道这可能是一项相当复杂的任务。我想知道也许有人为此特定任务编写了脚本。或者，也欢迎任何部分建议。我目前在 MacOS 中工作。

1 个回答

Voted

AdminBee · Answer 1 · 2020-02-11T04:24:48+08:00

以下awk程序应该可以工作。它( ... )在每一行中查找元素并检查它们是否符合“author(s), year”或“author(s)1 year1, author(s)2 year2, ...”模式。如果是这样，它会创建一个引用命令并替换该( ... )组；否则它会按原样离开组。

#!/usr/bin/awk -f


# This small function creates an 'authorYYYY'-style string from
# separate author and year fields. We split the "author" field
# additionally at each space in order to strip leading/trailing
# whitespace and further authors.
function contract(author, year)
{
    split(author,auth_fields," ");
    auth=tolower(auth_fields[1]);
    return sprintf("%s%4d",auth,year);
}



# This function checks if two strings correspond to "author name(s)" and
# "year", respectively.
function check_entry(string1, string2)
{
    if (string1 ~ /^ *([[:alpha:].-]+ *)+$/ && string2 ~ /^ *[[:digit:]]{4} *$/) return 1;
    return 0;
}




# This function creates a 'citation' command from a raw element. If the
# raw element does not conform to the reference syntax of 'author, year' or
# 'author1 year1,author2 year2, ...', we should leave it "as is", and return
# a "0" as indicator.
function create_cite(raw_elem)
{
    cite_argument=""

    # Split at ','. The single elements are either name(list) and year,
    # or space-separated name(list)-year statements.
    n_fields=split(raw_elem,sgl_elem,",");

    if (n_fields == 2 && check_entry(sgl_elem[1],sgl_elem[2]))
    {
        cite_argument=contract(sgl_elem[1],sgl_elem[2]);
    }
    else
    {
        for (k=1; k<=n_fields; k++)
        {
            n_subfield=split(sgl_elem[k],subfield," ");

            if (check_entry(subfield[1],subfield[n_subfield]))
            {
                new_elem=contract(subfield[1],subfield[n_subfield]);
                if (cite_argument == "")
                {
                    cite_argument=new_elem;
                }
                else
                {
                    cite_argument=sprintf("%s,%s",cite_argument,new_elem);
                }
            }
            else
            {
                return 0;
            }
        }
    }


    cite=sprintf("\\{%s}",cite_argument);
    return cite;
}




# Actual program
# For each line, create a "working copy" so we can replace '(...)' pairs
# already processed with different text (here: 'X ... Y'); otherwise 'sub'
# would always stumble across the same opening parentheses.
# For each '( ... )' found, check if it fits the pattern. If so, we replace
# it with a 'cite' command; otherwise we leave it as it is.

{
    working_copy=$0;

    # Allow for unmatched ')' at the beginning of the line:
    # if a ')' was found before the first '(', mark is as processed
    i=index(working_copy,"(");
    j=index(working_copy,")");
    if (i>0 && j>0 && j<i) sub(/\)/,"Y",working_copy);

    while (i=index(working_copy,"("))
    {
        sub(/\(/,"X",working_copy); # mark this '(' as "already processed

        j=index(working_copy,")");
        if (!j)
        {
            continue;
        }
        sub(/\)/,"Y",working_copy); # mark this ')', too


        elem=substr(working_copy,i+1,j-i-1);

        replacement=create_cite(elem);
        if (replacement != "0")
        {
            elem="\\(" elem "\\)"
            sub(elem,replacement);
        }

    }
    print $0;
}

调用程序

~$ awk -f transform_citation.awk input.tex

请注意，程序希望输入是“合理的”格式正确的，即一行上的所有括号都应该是一对匹配的（尽管允许在行首有一个右括号，不匹配的左括号将被忽略）。

另请注意，上面的某些语法需要 GNU awk。要移植到其他实现，请替换

if (string1 ~ /^ *([[:alpha:].-]+ *)+$/ && string2 ~ /^ *[[:digit:]]{4} *$/) return 1;

和

if (string1 ~ /^ *([a-zA-Z.-]+ *)+$/ && string2 ~ /^ *[0123456789][0123456789][0123456789][0123456789] *$/) return 1;

并确保您已将排序规则语言环境设置为C.

转换书目参考以与 Latex 一起使用

模块 i915 可能缺少固件 /lib/firmware/i915/*

无法获取 jessie backports 存储库

如何将 GPG 私钥和公钥导出到文件

我们如何运行存储在变量中的命令？

如何配置 systemd-resolved 和 systemd-networkd 以使用本地 DNS 服务器来解析本地域和远程 DNS 服务器来解析远程域？

dist-upgrade 后 Kali Linux 中的 apt-get update 错误 [重复]

如何从 systemctl 服务日志中查看最新的 x 行

Nano - 跳转到文件末尾

grub 错误：你需要先加载内核

如何下载软件包而不是使用 apt-get 命令安装它？

转换书目参考以与 Latex 一起使用

1 个回答

相关问题