我有一个项目列表,其中每个项目都有多行。分隔项目的标记是唯一的(每个项目, HTML <li>
),我只看到包含在单个标记化段落 (HTML <p>
) 中的文本实例。我想用它制作一个 TSV,按顺序排列哪些物品:
- 日期
- 姓名
- 网址
- 概括
从我所看到的所有项目中,URL 和名称都有重复项(在每个项目中),所以我选择了第一个 URL 和第二个名称,因为这对我来说似乎最简单。摘要可能包含视觉辅助标签(即<strong>
),所以我使用否定的前瞻来完成它,而不是不应该有内部标签的日期,所以我使用了否定的字符类。
前 2 项是
<li><p style="margin-bottom: 0in"><a href="https://www.rt.com/shows/on-contact/550756-america-long-war-race/">On
Contact: Race and America's long war </a>
</p>
<p style="margin-bottom: 0in"><a href="https://www.rt.com/shows/on-contact/550756-america-long-war-race/">
<font color="#000080">
<img src="rt.com-on_contact-220405-no_blurb_html_1dff87941f1c724a.jpg" name="Image1" alt="On Contact: Race and America's long war" align="bottom" width="280" height="157" border="1"/>
</font>
</a>
</p>
<p style="margin-bottom: 0in">On the show, Chris Hedges discusses
America's inner and outer wars and its nexus with capitalism and
empire with Professor of Social and Cultural Analysis and History at
New York University Nikhil Pal Singh. The internal violence in the
United...
</p>
<p style="margin-bottom: 0in">Feb 27, 2022 10:36</p>
<li><p style="margin-bottom: 0in"><a href="https://www.rt.com/shows/on-contact/550319-george-washington-genocidal-colonist/">
<font color="#000080">
<img src="rt.com-on_contact-220405-no_blurb_html_198feb67032166ff.png" name="Image3" alt="On Contact: George Washington and the legacy of white supremacy" align="bottom" width="280" height="157" border="1"/>
</font>
</a>
</p>
<p style="margin-bottom: 0in"><strong><a href="https://www.rt.com/shows/on-contact/550319-george-washington-genocidal-colonist/">On
Contact: George Washington and the legacy of white supremacy </a></strong>
</p>
<p style="margin-bottom: 0in">On the show, Chris Hedges discusses
George Washington, the fallible human being and one of the principal
architects of the United States, with author Nathaniel Philbrick. As
America fractures into ideologically hostile camps, it colors how
we...
</p>
<p style="margin-bottom: 0in">Feb 25, 2022 09:09
</p>
<li>[...]
我尝试的正则表达式是<li>.*<a href="([^"]+)".*alt="On Contact: ([^"]+)".*<p[^>]*>((?:.(?!<\/p>))+)<\/p><p[^>]*>([^<]+)<
,如果它有效,它将被替换为$4\t$2\t$1\t$3
. 我希望正则表达式在 Notepad++ 中工作。
感谢您的帮助
更新 1
我后来使用的测试字符串添加了列表项,在摘要中添加了显示标签(即<strong>
),虽然它与标题不一致,但我不得不删除标签,因为它们干扰 TSV 创建,我想我不妨删除其中的换行符过程(删除[\t\r\n]
),导致:
<li><p style="margin-bottom: 0in"><a href="https://www.rt.com/shows/on-contact/550756-america-long-war-race/">OnContact: Race and America's long war </a></p><p style="margin-bottom: 0in"><a href="https://www.rt.com/shows/on-contact/550756-america-long-war-race/"> <font color="#000080"> <img src="rt.com-on_contact-220405-no_blurb_html_1dff87941f1c724a.jpg" name="Image1" alt="On Contact: Race and America's long war" align="bottom" width="280" height="157" border="1"/> </font></a></p><p style="margin-bottom: 0in">On the show, Chris Hedges discussesAmerica's inner and outer wars and its nexus with capitalism and <strong>empire</strong> with Professor of Social and Cultural Analysis and History atNew York University Nikhil Pal Singh. The internal violence in theUnited... </p><p style="margin-bottom: 0in">Feb 27, 2022 10:36</p><li><p style="margin-bottom: 0in"><a href="https://www.rt.com/shows/on-contact/550319-george-washington-genocidal-colonist/"> <font color="#000080"> <img src="rt.com-on_contact-220405-no_blurb_html_198feb67032166ff.png" name="Image3" alt="On Contact: George Washington and the legacy of white supremacy" align="bottom" width="280" height="157" border="1"/> </font></a></p><p style="margin-bottom: 0in"><strong><a href="https://www.rt.com/shows/on-contact/550319-george-washington-genocidal-colonist/">OnContact: George Washington and the legacy of white supremacy </a></strong></p><p style="margin-bottom: 0in">On the show, <span class="host">Chris Hedges</span> discusses George Washington, the fallible human being and one of the principalarchitects of the United States, with author Nathaniel Philbrick. AsAmerica fractures into ideologically hostile camps, it colors howwe... </p><p style="margin-bottom: 0in">Feb 25, 2022 09:09 </p><li><p style="margin-bottom: 0in"><a href="https://www.rt.com/shows/on-contact/549103-oppenheimer-bomb-culture-bird/"> <font color="#000080"> <img src="rt.com-on_contact-220405-no_blurb_html_e46c470920b1171d.jpg" name="Image4" alt="On Contact: Oppenheimer & the bomb culture" align="bottom" width="420" height="236" border="1"/> </font></a></p><p style="margin-bottom: 0in"><strong><a href="https://www.rt.com/shows/on-contact/549103-oppenheimer-bomb-culture-bird/">OnContact: Oppenheimer & the bomb culture </a></strong></p><p style="margin-bottom: 0in">On the show, Chris Hedges discusses J.Robert Oppenheimer and the making of the bomb with author <span class="author">Kai Bird.J. Robert Oppenheimer</span>, “the father of the atomic bomb,”was by the end of World War II one of the most celebrated men inAmerica.... </p><p style="margin-bottom: 0in">Feb 20, 2022 06:10 </p><li><p style="margin-bottom: 0in"><a href="https://www.rt.com/shows/on-contact/469859-war-iran-stephen-kinzer/"> <font color="#000080"> <img src="rt.com-on_contact-220405-no_blurb_html_15449064d00f77f3.jpg" name="Image149" alt="On Contact – War with Iran? Stephen Kinzer" align="bottom" width="420" height="236" border="1"/> </font></a></p><p style="margin-bottom: 0in"><strong><a href="https://www.rt.com/shows/on-contact/469859-war-iran-stephen-kinzer/">OnContact – War with Iran? Stephen Kinzer </a></strong></p><p style="margin-bottom: 0in">Host Chris Hedges talks to journalistand author, Stephen Kinzer, on efforts by Saudi Arabia and Washington to cripple Iran’s economy, inevitably putting Saudi Arabia, its Gulf allies and Washington on a collision course with the <em>Islamic</em>... </p><p style="margin-bottom: 0in">Sep 29, 2019 07:10 </p><li><p style="margin-bottom: 0in"><a href="https://www.rt.com/shows/on-contact/469339-future-amazon-rain-forest/"> <font color="#000080"> <img src="rt.com-on_contact-220405-no_blurb_html_b82502a96022a758.png" name="Image150" alt="The future of the Amazon rain forest – Sonia Bone Guajajara" align="bottom" width="280" height="157" border="1"/> </font></a></p><p style="margin-bottom: 0in"><strong><a href="https://www.rt.com/shows/on-contact/469339-future-amazon-rain-forest/">Thefuture of the Amazon rain forest – Sonia Bone Guajajara </a></strong></p><p style="margin-bottom: 0in">Host Chris Hedges talks to Sonia BoneGuajajara, leader of 300 indigenous ethnic groups in Brazil, aboutthe future of the Amazon rain forest, its people, climate change,and the competing goals of agrobusiness, multinational corporations,and the... </p><p style="margin-bottom: 0in">Sep 22, 2019 07:15 </p></ul>
您的正则表达式包含一些错误,使其与文本不匹配。
\/
==>/
.*
用非贪婪的替换所有你的.*?
(?:.(?!</p>))+
应该是(?:(?!</p>).)+
此外,
<li>
示例文本中的 2 没有相同的结构:<p>
第二段中有图像<p>
第一段中有图像那么捕获组不会捕获相同的数据。
您可以在此处查看正则表达式
我已经改变了一点你的正则表达式,假设想要的段落不包含任何标签,它适用于你的例子:
演示和解释
在记事本++中运行
<li>.*?<a href="([^"]+)".*?alt="On Contact: ([^"]+)".*?<p[^>]*>((?:(?![<>]).)+?)</p>.*?<p[^>]*>([a-zA-Z]{3} \d\d?, \d{4} \d\d?:\d\d)\s*</p>
$4\n$2\n$1\n$3\n\n
. matches newline
截图(之前):
截图(之后):
我喜欢分解问题并尝试优化我发现的
.*
任何问题。.*?
请注意,如果 HTML 的结构发生变化,则破坏的可能性要高得多。我也是支持
/x
标志的正则表达式的粉丝,因此我可以添加空格和注释以帮助所有内容融入我的大脑。这就是我想出的,加上评论以帮助理解每个部分在做什么:
您可以在此处看到这对您的原始文本的作用。
相同的正则表达式,但去掉了注释行和空格也可以在这里找到,它应该可以直接放入 Notepad++ 或任何你拥有的 PCRE2 兼容工具。