XML实体扩展的命令行工具

Question

Ashark

Asked: 2023-05-05 07:43:05 +0800 CST2023-05-05 07:43:05 +0800 CST 2023-05-05 07:43:05 +0800 CST

如何从 fb2 书籍中提取目录？

772

我有一本 fb2 格式的书。我想打印目录，其中包含“部分”、“章节”、“剧集”等的名称和编号。

有没有办法可以从终端执行此操作？有一个类似的问题，但对于 epub 格式。

我知道 fb2 是一种 xml 格式。但是有没有一种工具可以只提取TOC？它们在标签<section>,<title>和内<subtitle>。

如果没有，我想可以根据官方的FB2_to_txt.xsl文件制作 xsl 文件。也许ebook-convert可以做到这一点？

我正在写的书具有以下结构：

<?xml version="1.0" encoding="utf8"?>
<FictionBook xmlns:l="http://www.w3.org/1999/xlink" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://www.gribuser.ru/xml/fictionbook/2.0">
  <description>
    <title-info>
      <genre>fiction</genre>
      <author>
        <first-name>John</first-name>
        <last-name>Doe</last-name>
      </author>
      <book-title>Fiction Book</book-title>
      <annotation>
        <p>Hello</p>
      </annotation>
      <keywords>john, doe, fiction</keywords>
      <date value="2011-07-18">18.07.2011</date>
      <coverpage></coverpage>
      <lang>en</lang>
    </title-info>
    <document-info>
      <author>
        <first-name></first-name>
        <last-name></last-name>
        <nickname></nickname>
      </author>
      <program-used>Fb2 Gem</program-used>
      <date value="2011-07-18">18.07.2011</date>
      <src-url></src-url>
      <src-ocr></src-ocr>
      <id></id>
      <version>1.0</version>
    </document-info>
    <publish-info>
    </publish-info>
  </description>
  <body>
    <title>
      <p>John Doe</p>
      <empty-line/>
      <p>Fiction Book</p>
    </title>
    <section>
      <title>
        <p>Part 1</p>
        <p>Some name of Part 1</p>
      </title>
      <section>
        <title>
          <p>Chapter 1</p>
          <p>Some name of Chapter 1</p>
        </title>
        <subtitle>Episode 1</subtitle>
        <p>Line one of the first episode</p>
        <p>Line two of the first episode</p>
        <p>Line three of the first episode</p>
        <subtitle>Episode 2</subtitle>
        <p>Line one of the second episode</p>
        <p>Line two of the second episode</p>
        <p>Line three of the second episode</p>
      </section>
    </section>
    <section>
      <title>
        <p>Part 2</p>
        <p>Some name of Part 2</p>
      </title>
      <section>
        <title>
          <p>Chapter 3</p>
          <p>Some name of Chapter 3</p>
        </title>
        <subtitle>Episode 3</subtitle>
        <p>Line one of the third episode</p>
        <p>Line two of the third episode</p>
        <p>Line three of the third episode</p>
        <subtitle>Episode 4</subtitle>
        <p>Line one of the fourth episode</p>
        <p>Line two of the fourth episode</p>
        <p>Line three of the fourth episode</p>
      </section>
    </section>
  </body>
</FictionBook>

我想在输出中获得以下内容：

Part 1
Some name of Part 1
Chapter 1
Some name of Chapter 1
Episode 1
Episode 2
Part 2
Some name of Part 2
Chapter 3
Some name of Chapter 3
Episode 3
Episode 4

3 个回答

Voted

Kusalananda · Answer 1 · 2023-05-05T15:57:28+08:00

使用xmlstarlet：

xmlstarlet select --template \
    --value-of '//_:section/_:title/_:p | //_:subtitle' \
    -nl file.xml

或者，使用短选项，

xmlstarlet sel -t \
    -v '//_:section/_:title/_:p | //_:subtitle' \
    -n file.xml

此处使用的 XPath 查询将提取每个下节点p的节点值，以及所有节点的值。titlesectionsubtitle

_:表达式中每个节点名称之前的前缀是文档正在使用的命名空间标识符的匿名占位符。

给定您的示例文档，以上两个命令中的任何一个的输出将是

Part 1
Some name of Part 1
Chapter 1
Some name of Chapter 1
Episode 1
Episode 2
Part 2
Some name of Part 2
Chapter 3
Some name of Chapter 3
Episode 3
Episode 4

您是否还需要书名，然后删除_:section表达式中的限制（这将使书名的p节点也匹配）。

另一种获取每个部分的标题和副标题的方法（避免使用书名）可能看起来更简洁（因为它表明字幕是从各部分中提取的，而不仅仅是从任何地方提取的），是首先限制匹配到部分，然后从这些部分获取数据：

xmlstarlet select --template \
    --match '//_:section' \
    --value-of '_:title/_:p | _:subtitle' \
    -nl file.xml

Gilles Quénot · Answer 2 · 2023-05-05T22:10:58+08:00

Gilles Quénot

2023-05-05T22:10:58+08:002023-05-05T22:10:58+08:00

使用XPath3aware FOSS(GPLv3) 命令行工具，xidel：

XPath2 构建序列：

xidel -e '(//section/title/p, //subtitle)'  file.xml

XPath1:

xidel -e '//section/title/p | //subtitle'  file.xml

Part 1
Some name of Part 1
Chapter 1
Some name of Chapter 1
Episode 1
Episode 2
Part 2
Some name of Part 2
Chapter 3
Some name of Chapter 3
Episode 3
Episode 4

xidel是查询 XML/HTML/JSON 的瑞士军刀。它足够聪明，可以namespace自行管理默认值。

2

Michael Kay · Answer 3 · 2023-05-05T15:37:17+08:00

Michael Kay

2023-05-05T15:37:17+08:002023-05-05T15:37:17+08:00

在我看来，输出包含 XPath 表达式的结果(//title/p | //subtitle)。因此，您只需要找到适合您的环境的工具即可执行该 XPath 表达式并显示结果。

有关一些建议的命令行工具，请参阅https://www.baeldung.com/linux/evaluate-xpath 。还有 Saxon 的 Gizmo 工具（我公司的产品）。

-1

如何从 fb2 书籍中提取目录？

模块 i915 可能缺少固件 /lib/firmware/i915/*

无法获取 jessie backports 存储库

如何将 GPG 私钥和公钥导出到文件

我们如何运行存储在变量中的命令？

如何配置 systemd-resolved 和 systemd-networkd 以使用本地 DNS 服务器来解析本地域和远程 DNS 服务器来解析远程域？

dist-upgrade 后 Kali Linux 中的 apt-get update 错误 [重复]

如何从 systemctl 服务日志中查看最新的 x 行

Nano - 跳转到文件末尾

grub 错误：你需要先加载内核

如何下载软件包而不是使用 apt-get 命令安装它？

如何从 fb2 书籍中提取目录？

3 个回答

相关问题