pdf文件的页面到变量中

Question

Tim

Asked: 2019-04-20 18:15:53 +0800 CST2019-04-20 18:15:53 +0800 CST 2019-04-20 18:15:53 +0800 CST

我们可以在 pdf 文件中搜索包含多个单词且不按特定顺序排列的页面吗？

772

我想在一个 pdf 文件中搜索所有页面，每个页面都包含几个给定的单词，没有特定的顺序。例如，我想查找所有包含“hello”和“world”的页面，没有特定的顺序。

我不确定是否pdfgrep 可以做到。

我正在尝试做一些类似于我们如何在谷歌图书中显示的书中搜索几个单词的事情。

谢谢。

2 个回答

Voted

mosvy · Answer 1 · 2019-04-20T18:23:33+08:00

-P是的，如果您使用该选项（让它使用PCRE引擎和类似 perl 的正则表达式），您可以使用零宽度前瞻断言来做到这一点。

$ pdfgrep -Pn '(?=.*process)(?=.*preparation)' ~/Str-Cmp.pdf
8:•     If a preparation process is used, the method used shall be declared.
10:Standard, preparation may be an important part of the ordering process. See Annex C for some examples of
38:padding. The preparation processing could move the original numerals (in order of occurrence) to the very

以上仅在两个单词在同一行时才有效；如果单词可以出现在同一页面的不同行上，则可以执行以下操作：

$ pdfgrep -Pn '^(?s:(?=.*process)(?=.*preparation))' ~/Str-Cmp.pdf
8:ISO/IEC 14651:2007(E)
9:                                                                                                  ISO/IEC 14651:2007(E)
10:ISO/IEC 14651:2007(E)
12:ISO/IEC 14651:2007(E)
...

中的s标志(?s:意味着也.将匹配换行符。请注意，这只会打印页面的第一行；您可以使用以下-A选项进行调整：

$ pdfgrep -A4 -Pn '^(?s:(?=.*process)(?=.*preparation))' ~/Str-Cmp.pdf
8:ISO/IEC 14651:2007(E)
8-•     Any specific internal format for intermediate keys used when comparing, nor for the table used. The use of
8-      numeric keys is not mandated either.
8-•     A context-dependent ordering.
8-•     Any particular preparation of character strings prior to comparison.
--
9:                                                                                                  ISO/IEC 14651:2007(E)
...

一个粗略的包装脚本，它将以任何顺序从与所有模式匹配的页面中打印与任何模式匹配的行：

usage: pdfgrepa [options] files ... -- patterns ...

#! /bin/sh
r1= r2=
for a; do
        if [ "$r2" ]; then
                r1="$r1(?=.*$a)"; r2="$r2|$a"
        else
                case $a in
                --)     r2='(?=^--$)';;
                *)      set -- "$@" "$a";;
                esac
        fi
        shift
done
pdfgrep -A10000 -Pn "(?s:$r1)" "$@" | grep -P --color "$r2"

$ pdfgrepa ~/Str-Cmp.pdf -i -- obtains process preparation 37- the strings after preparation are identical, and the end result (as the user would normally see it) could be 37- collation process applying the same rules. This kind of indeterminacy is undesirable. 37-one obtains after this preparation the following strings:

user1133275 · Answer 2 · 2019-04-20T18:45:25+08:00

user1133275

2019-04-20T18:45:25+08:002019-04-20T18:45:25+08:00

pdfgrep -nP 'hello.{1,99}world|world.{1,99}hello' a.pdf

https://pdfgrep.org/doc.html

1

我们可以在 pdf 文件中搜索包含多个单词且不按特定顺序排列的页面吗？

模块 i915 可能缺少固件 /lib/firmware/i915/*

无法获取 jessie backports 存储库

如何将 GPG 私钥和公钥导出到文件

我们如何运行存储在变量中的命令？

如何配置 systemd-resolved 和 systemd-networkd 以使用本地 DNS 服务器来解析本地域和远程 DNS 服务器来解析远程域？

dist-upgrade 后 Kali Linux 中的 apt-get update 错误 [重复]

如何从 systemctl 服务日志中查看最新的 x 行

Nano - 跳转到文件末尾

grub 错误：你需要先加载内核

如何下载软件包而不是使用 apt-get 命令安装它？

我们可以在 pdf 文件中搜索包含多个单词且不按特定顺序排列的页面吗？

2 个回答

相关问题