将复制活动的序列号添加到 Blob

Question

JontroPothon

Asked: 2025-02-13 03:17:56 +0800 CST2025-02-13 03:17:56 +0800 CST 2025-02-13 03:17:56 +0800 CST

使用 tidyr 分离字母数字字符串单独的更宽的正则表达式

772

我有以下数据，

id <- c("case1", "case19", "case88", "case77")
vec <- c("One_20 (19)",
         "tWo_20 (290)",
         "Three_38 (399)",
         NA)

df <- data.frame(id, vec)

> df
      id            vec
1  case1    One_20 (19)
2 case19   tWo_20 (290)
3 case88 Three_38 (399)
4 case77           <NA>

我想将vec向量分成两个变量，即：txt和。我更喜欢这样num使用，tidyr

df |> tidyr::separate_wider_regex(vec, 
                                   c(txt = "[A-Za-z]+", num = "\\d+"),
                                   too_few = "align_start")
# A tibble: 4 × 3
  id     txt   num  
  <chr>  <chr> <chr>
1 case1  One   NA   
2 case19 tWo   NA   
3 case88 Three NA   
4 case77 NA    NA

但是，这不是我想要的。我有以下期望：

      id      txt num
1  case1   One_20  19
2 case19   tWo_20 290
3 case88 Three_38 399
4 case77     <NA>  NA

我在正则表达式部分犯了错误。有什么帮助可以纠正这些错误，以便我能够得到预期的表格作为输出吗？

4 个回答

Voted

Friede · Answer 1 · 2025-02-13T03:29:25+08:00

Friede

2025-02-13T03:29:25+08:002025-02-13T03:29:25+08:00

在基础 R 中使用以下方法sub()：

cbind(df['id'], {
  l = strsplit(sub('^(.*) \\((.*)\\)$', '\\1 \\2', df$vec), ' ')
  lapply(l, `length<-`, max(lengths(l))) |>
    do.call(what = 'rbind')
  }) |> setNames(c('id', 'txt', 'num'))

      id      txt  num
1  case1   One_20   19
2 case19   tWo_20  290
3 case88 Three_38  399
4 case77     <NA> <NA>

5

ThomasIsCoding · Answer 2 · 2025-02-13T03:49:33+08:00

Best Answer

ThomasIsCoding

2025-02-13T03:49:33+08:002025-02-13T03:49:33+08:00

尝试

> df %>%
+     separate_wider_regex(vec,
+         c(txt = "\\w+", "\\s+\\(", num = "\\d+","\\)"),
+         too_few = "align_start"
+     )
# A tibble: 4 × 3
  id     txt      num  
  <chr>  <chr>    <chr>
1 case1  One_20   19
2 case19 tWo_20   290
3 case88 Three_38 399
4 case77 NA       NA

3

jpsmith · Answer 3 · 2025-02-13T05:33:36+08:00

jpsmith

2025-02-13T05:33:36+08:002025-02-13T05:33:36+08:00

我并不总是最擅长使用正则表达式，所以尽量避免使用它。对于具有类似数据的人来说，不使用正则表达式的方法是使用separate_wider_delim。这会将“number_text”与“(number)”分开，然后readr::parse_number从中提取数值num：

df %>%
  separate_wider_delim(vec, " ", names = c("txt", "num")) %>%
  mutate(num = readr::parse_number(num))

#   id     txt        num
#   <chr>  <chr>    <dbl>
# 1 case1  One_20      19
# 2 case19 tWo_20     290
# 3 case88 Three_38   399
# 4 case77 NA          NA

您也可以parse_number用您选择的其他方法替换，即mutate(num = as.numeric(gsub("\\(|\\)", "", num)))。

0

jtatria · Answer 4 · 2025-02-13T06:24:39+08:00

jtatria

2025-02-13T06:24:39+08:002025-02-13T06:24:39+08:00

只要您的正则表达式构造良好，您就不需要使用任何外部包或花哨的单行程序。

针对您的特殊需要，此模式有效：

rx <- "([A-Za-z]+_[0-9]{2}) (\\([0-9]+\\))"

然后，您可以直接使用它来分配给 df 中的必要列，或者使用 sub：

df$txt <- sub( rx, "\\1", df$vec )
df$num <- sub( rx, "\\2", df$vec )

或者，如果您想避免多次运行正则表达式，请使用 regmatches/regexec 和 lapply：

match <- df$vec %>% regmatches( regexec( rx, . ) )
df$txt <- lapply( match, function( x ) x[2] )
df$num <- lapply( match, function( x ) x[3] )

0

使用 tidyr 分离字母数字字符串单独的更宽的正则表达式

为什么 C++20 概念会导致循环约束错误，而老式的 SFINAE 不会？

VScode 自动卸载扩展的问题（Material 主题）

Vue 3：创建时出错“预期标识符但发现‘导入’”[重复]

具有指定基础类型但没有枚举器的“枚举类”的用途是什么？

如何修复未手动导入的模块的 MODULE_NOT_FOUND 错误？

`(表达式，左值) = 右值` 在 C 或 C++ 中是有效的赋值吗？为什么有些编译器会接受/拒绝它？

何时应使用 std::inplace_vector 而不是 std::vector？

在 C++ 中，一个不执行任何操作的空程序需要 204KB 的堆，但在 C 中则不需要

PowerBI 目前与 BigQuery 不兼容：Simba 驱动程序与 Windows 更新有关

AdMob：MobileAds.initialize() - 对于某些设备，“java.lang.Integer 无法转换为 java.lang.String”

使用 tidyr 分离字母数字字符串单独的更宽的正则表达式

4 个回答

相关问题