将复制活动的序列号添加到 Blob

Question

bill999

Asked: 2024-08-08 03:16:09 +0800 CST2024-08-08 03:16:09 +0800 CST 2024-08-08 03:16:09 +0800 CST

如何从 rvest 抓取的网站创建数据框，保留数据的嵌套结构

772

假设我read_html_live()从rvest包中提取了一些如下所示的代码：

books <- minimal_html('
  <div>
    <div class="book">
      <div class="booktitle">Book 1</div>
      <div class="year">1999</div>
      <div class="author">Author 1</div>
      <div class="author">Author 2</div>
      <div class="author">Author 3</div>
    </div>
    <div class="book">
      <div class="booktitle">Book 2</div>
      <div class="year">2022</div>
      <div class="author">Author 4</div>
    </div>
    <div class="book">
      <div class="booktitle">Book 3</div>
      <div class="year">1845</div>
      <div class="author">Author 5</div>
      <div class="author">Author 6</div>
      <div class="author">Author 7</div>
      <div class="author">Author 8</div>
    </div>    
  </div>')

我想使用该rvest包创建一个包含上述信息的数据框（或 tibble 也可以）。我希望它按作者级别进行组织，这样每行将包含作者、书名和年份。

如果我只关心第一作者，那就简单了。例如：

data0 <- books %>% html_elements(".book")
title <- data0 %>% html_element(".booktitle") %>% html_text2()
year <- data0 %>% html_element(".year") %>% html_text2()
author1 <- data0 %>% html_element("author") %>% html_text2()
data <- data.frame(title, year, author1)

但是，我实际上想提取所有作者，作者是书中的“子作者”。数据框现在将有八行，每个作者一行。例如，第 8 行将有Book 3、1845和Author 8。我该怎么做？

这是一个粗略的想法，但我正在寻找更简单的解决方案：

data0 <- books %>% html_elements(".book") 
title <- data0 %>% html_element(".booktitle") %>% html_text2()
year <- data0 %>% html_element(".year") %>% html_text2()

authors <- data0 %>% html_element(".author")

然后循环遍历作者的三个元素，并将它们分别保存到数据框中。然后将每个作者数据框与相关标题和年份关联起来，并以某种方式将其转换为长数据框。

2 个回答

Voted

score 3 · Answer 1 · 2024-08-08T03:42:19+08:00

lapply以下是循环遍历书籍节点的一种方法：

library(rvest)
library(dplyr, warn = FALSE)
books <- minimal_html('
  <div>
    <div class="book">
      <div class="booktitle">Book 1</div>
      <div class="year">1999</div>
      <div class="author">Author 1</div>
      <div class="author">Author 2</div>
      <div class="author">Author 3</div>
    </div>
    <div class="book">
      <div class="booktitle">Book 2</div>
      <div class="year">2022</div>
      <div class="author">Author 4</div>
    </div>
    <div class="book">
      <div class="booktitle">Book 3</div>
      <div class="year">1845</div>
      <div class="author">Author 5</div>
      <div class="author">Author 6</div>
      <div class="author">Author 7</div>
      <div class="author">Author 8</div>
    </div>
  </div>')

data0 <- books %>%
  html_elements(".book") |>
  lapply(\(x) {
    tibble(
      title = x |> html_element(".booktitle") |> html_text2(),
      year = x |> html_element(".year") |> html_text2(),
      authors = x |> html_elements(".author") |> html_text2(),
    )
  }) |>
  bind_rows()

data0
#> # A tibble: 8 × 3
#>   title  year  authors 
#>   <chr>  <chr> <chr>   
#> 1 Book 1 1999  Author 1
#> 2 Book 1 1999  Author 2
#> 3 Book 1 1999  Author 3
#> 4 Book 2 2022  Author 4
#> 5 Book 3 1845  Author 5
#> 6 Book 3 1845  Author 6
#> 7 Book 3 1845  Author 7
#> 8 Book 3 1845  Author 8

score 3 · Answer 2 · 2024-08-08T03:48:23+08:00

这会将类属性和文本放入以长格式输出的名称-值对数据集中。将书籍标识符 ( book) 添加到输出数据框中，以便更轻松地执行分组操作（例如转换为宽格式）：

library(rvest)
library(purrr)

book <- html_elements(books, xpath = "//div[@class='book']") 

data <- map_dfr(seq_along(book), \(i) {
  b <- book[[i]]
  children <- html_children(b)
  data.frame(book = i,
             name = children |> html_attrs() |> unlist(use.names = F),
             value = html_text2(children))
})
#    book      name    value
# 1     1 booktitle   Book 1
# 2     1      year     1999
# 3     1    author Author 1
# 4     1    author Author 2
# 5     1    author Author 3
# 6     2 booktitle   Book 2
# 7     2      year     2022
# 8     2    author Author 4
# 9     3 booktitle   Book 3
# 10    3      year     1845
# 11    3    author Author 5
# 12    3    author Author 6
# 13    3    author Author 7
# 14    3    author Author 8

例如，

library(tidyr)

pivot_wider(data, id_cols = book, values_fn = toString)
#    book booktitle year  author                             
# 1     1 Book 1    1999  Author 1, Author 2, Author 3          
# 2     2 Book 2    2022  Author 4                              
# 3     3 Book 3    1845  Author 5, Author 6, Author 7, Author 8

如何从 rvest 抓取的网站创建数据框，保留数据的嵌套结构

Vue 3：创建时出错“预期标识符但发现‘导入’”[重复]

为什么这个简单而小的 Java 代码在所有 Graal JVM 上的运行速度都快 30 倍，但在任何 Oracle JVM 上却不行？

具有指定基础类型但没有枚举器的“枚举类”的用途是什么？

如何修复未手动导入的模块的 MODULE_NOT_FOUND 错误？

`(表达式，左值) = 右值` 在 C 或 C++ 中是有效的赋值吗？为什么有些编译器会接受/拒绝它？

何时应使用 std::inplace_vector 而不是 std::vector？

在 C++ 中，一个不执行任何操作的空程序需要 204KB 的堆，但在 C 中则不需要

PowerBI 目前与 BigQuery 不兼容：Simba 驱动程序与 Windows 更新有关

AdMob：MobileAds.initialize() - 对于某些设备，“java.lang.Integer 无法转换为 java.lang.String”

我正在尝试仅使用海龟随机和数学模块来制作吃豆人游戏

如何从 rvest 抓取的网站创建数据框，保留数据的嵌套结构

2 个回答

相关问题