如何将 for 循环拆分为 3 个单独的数据框？

Question

Simon

Asked: 2024-05-23 23:59:20 +0800 CST2024-05-23 23:59:20 +0800 CST 2024-05-23 23:59:20 +0800 CST

文本数据中搜索词的处理极性

772

我有一段Python脚本，它从一个JSON文件中加载搜索词，并处理Pandas DataFrame以添加新列，指示文本数据中是否存在某些词。然而，我想修改脚本，使用Polars而不是Pandas，并可能去除对JSON的依赖。这是我原始的代码： ```python import pandas as pd import json class SearchTermLoader: def __init__(self, json_file): self.json_file = json_file def load_terms(self): with open(self.json_file, 'r') as f: data = json.load(f) terms = {} for phase_name, phase_data in data.items(): terms[phase_name] = ( phase_data.get('words', []), phase_data.get('exact_phrases', []) ) return terms class DataFrameProcessor: def __init__(self, df: pd.DataFrame, col_name: str) -> None: self.df = df self.col_name = col_name def add_contains_columns(self, search_terms): columns_to_add = ["type1", "type2"] for column in columns_to_add: self.df[column] = self.df[self.col_name].apply( lambda text: any( term in text for term in search_terms.get(column, ([], []))[0] + search_terms.get(column, ([], []))[1] ) ) return self.df # 示例用法 data = {'text_column': ['The apple is red', 'I like bananas', 'Cherries are tasty']} df = pd.DataFrame(data) term_loader = SearchTermLoader('word_list.json') search_terms = term_loader.load_terms() processor = DataFrameProcessor(df, 'text_column') new_df = processor.add_contains_columns(search_terms) new_df ``` 这是JSON文件的示例： ```json { "type1": { "words": ["apple", "tasty"], "exact_phrases": ["soccer ball"] }, "type2": { "words": ["banana"], "exact_phrases": ["red apple"] } } ``` 我知道我可以使用`.str.contains()`函数，但我想用它来匹配特定的词和确切的短语。你能提供一些如何开始的指导吗？

1 个回答

Voted

jqurious · Answer 1 · 2024-05-24T00:44:12+08:00

对于非正则表达式匹配，.str.contains_any()可能是一个更好的选择。

看起来你想要连接这两个列表：

word_list = pl.read_json("word_list.json")

# 对于没有结构体"*"展开的旧版本
type1 = pl.concat_list(
   pl.col("type1").field("words", "exact_phrases")
)
#

word_list = word_list.select(
   type1 = pl.concat_list(pl.col("type1").fields["*"]),
   type2 = pl.concat_list(pl.col("type2").fields["*"])
)

形状: (1, 2)
┌───────────────────────────────────┬─────────────────────────┐
│ type1                             ┆ type2                   │
│ ---                               ┆ ---                     │
│ list[str]                         ┆ list[str]               │
╞═══════════════════════════════════╪═════════════════════════╡
│ ["apple", "tasty", "soccer ball"] ┆ ["banana", "red apple"] │
└───────────────────────────────────┴─────────────────────────┘

然后你可以将它们.concat()到你的框架中，并运行.contains_any()

new_df = pl.concat([df, word_list], how="horizontal")

new_df.with_columns(
   type1 = pl.col("text_column").str.contains_any(pl.col("type1").flatten()),
   type2 = pl.col("text_column").str.contains_any(pl.col("type2").flatten())
)

形状: (3, 3)
┌─────────────────────────────┬───────┬───────┐
│ text_column                 ┆ type1 ┆ type2 │
│ ---                         ┆ ---   ┆ ---   │
│ str                         ┆ bool  ┆ bool  │
╞═════════════════════════════╪═══════╪═══════╡
│ The apple is red            ┆ true  ┆ false │
│ I like bananas              ┆ false ┆ true  │
│ Cherries are tasty          ┆ true  ┆ false │
└─────────────────────────────┴───────┴───────┘

文本数据中搜索词的处理极性

Vue 3：创建时出错“预期标识符但发现‘导入’”[重复]

为什么这个简单而小的 Java 代码在所有 Graal JVM 上的运行速度都快 30 倍，但在任何 Oracle JVM 上却不行？

具有指定基础类型但没有枚举器的“枚举类”的用途是什么？

如何修复未手动导入的模块的 MODULE_NOT_FOUND 错误？

`(表达式，左值) = 右值` 在 C 或 C++ 中是有效的赋值吗？为什么有些编译器会接受/拒绝它？

何时应使用 std::inplace_vector 而不是 std::vector？

在 C++ 中，一个不执行任何操作的空程序需要 204KB 的堆，但在 C 中则不需要

PowerBI 目前与 BigQuery 不兼容：Simba 驱动程序与 Windows 更新有关

AdMob：MobileAds.initialize() - 对于某些设备，“java.lang.Integer 无法转换为 java.lang.String”

我正在尝试仅使用海龟随机和数学模块来制作吃豆人游戏

文本数据中搜索词的处理极性

1 个回答

相关问题