如何将 for 循环拆分为 3 个单独的数据框？

Question

bzm3r

Asked: 2024-09-11 00:16:29 +0800 CST2024-09-11 00:16:29 +0800 CST 2024-09-11 00:16:29 +0800 CST

使用 Polars，如何有效地执行“over”操作以将项目收集到列表中？

772

作为一个简单的示例，考虑以下内容，使用groupby：

import polars as pl

df = pl.DataFrame(
    [pl.Series("id", ["a", "b", "a"]), pl.Series("x", [0, 1, 2])]
)
print(df.group_by("id").agg(pl.col("x")))

# shape: (2, 2)
# ┌─────┬───────────┐
# │ id  ┆ x         │
# │ --- ┆ ---       │
# │ str ┆ list[i64] │
# ╞═════╪═══════════╡
# │ b   ┆ [1]       │
# │ a   ┆ [0, 2]    │
# └─────┴───────────┘

但如果我们使用over，我们会得到：

import polars as pl

df = pl.DataFrame(
    [pl.Series("id", ["a", "b", "a"]), pl.Series("x", [0, 1, 2])]
)
print(df.with_columns(pl.col("x").over("id")))
# shape: (3, 2)
# ┌─────┬─────┐
# │ id  ┆ x   │
# │ --- ┆ --- │
# │ str ┆ i64 │
# ╞═════╪═════╡
# │ a   ┆ 0   │
# │ b   ┆ 1   │
# │ a   ┆ 2   │
# └─────┴─────┘

如何groupby使用实现结果over？嗯，使用mapping_strategy="join"。

一个稍微复杂一点的例子，旨在展示为什么我们可能想要使用over而不是groupby：

import polars as pl

# the smallest value a Float32 can encode is 1e-38
# therefore, as far as we are concerned,
# 1e-41 and 1e-42 should be indistinguishable
# in other words, we do not want to use "other" as an id column
# but we do want to preserve other!
df = pl.DataFrame(
    [
        pl.Series("id", ["a", "b", "a"]),
        pl.Series("other", [1e-41, 1e-16, 1e-42], dtype=pl.Float32()),
        pl.Series("x", [0, 1, 2]),
    ]
)
print(df.group_by("id").agg(pl.col("x"), pl.col("other")).explode("other"))

# shape: (3, 3)
# ┌─────┬───────────┬────────────┐
# │ id  ┆ x         ┆ other      │
# │ --- ┆ ---       ┆ ---        │
# │ str ┆ list[i64] ┆ f32        │
# ╞═════╪═══════════╪════════════╡
# │ a   ┆ [0, 2]    ┆ 9.9997e-42 │
# │ a   ┆ [0, 2]    ┆ 1.0005e-42 │
# │ b   ┆ [1]       ┆ 1.0000e-16 │
# └─────┴───────────┴────────────┘

现在，使用over：

import polars as pl

# the smallest value a Float32 can encode is 1e-38
# therefore, as far as we are concerned,
# 1e-41 and 1e-42 should be indistinguishable
# in other words, we do not want to use "other" as an id column
# but we do want to preserve other!
df = pl.DataFrame(
    [
        pl.Series("id", ["a", "b", "a"]),
        pl.Series("other", [1e-41, 1e-16, 1e-42], dtype=pl.Float32()),
        pl.Series("x", [0, 1, 2]),
    ]
)
print(df.with_columns(pl.col("x").over(["id"], mapping_strategy="join")))

# shape: (3, 3)
# ┌─────┬────────────┬───────────┐
# │ id  ┆ other      ┆ x         │
# │ --- ┆ ---        ┆ ---       │
# │ str ┆ f32        ┆ list[i64] │
# ╞═════╪════════════╪═══════════╡
# │ a   ┆ 9.9997e-42 ┆ [0, 2]    │
# │ b   ┆ 1.0000e-16 ┆ [1]       │
# │ a   ┆ 1.0005e-42 ┆ [0, 2]    │
# └─────┴────────────┴───────────┘

问题mapping_strategy="join"是使用起来很慢。所以，这意味着我应该先执行 a，group_by然后执行 a join：

import polars as pl
import polars.selectors as cs

# the smallest value a Float32 can encode is 1e-38
# therefore, as far as we are concerned,
# 1e-41 and 1e-42 should be indistinguishable
# in other words, we do not want to use "other" as an id column
# but we do want to preserve other!
df = pl.DataFrame(
    [
        pl.Series("id", ["a", "b", "a"]),
        pl.Series("other", [1e-41, 1e-16, 1e-42], dtype=pl.Float32()),
        pl.Series("x", [0, 1, 2]),
    ]
)


print(
    df.select(cs.exclude("x")).join(
        df.group_by("id").agg("x"),
        on="id",
        # we expect there to be multiple "id"s on the left, matching
        # a single "id" on the right
        validate="m:1",
    )
)

# shape: (3, 3)
# ┌─────┬────────────┬───────────┐
# │ id  ┆ other      ┆ x         │
# │ --- ┆ ---        ┆ ---       │
# │ str ┆ f32        ┆ list[i64] │
# ╞═════╪════════════╪═══════════╡
# │ a   ┆ 9.9997e-42 ┆ [0, 2]    │
# │ b   ┆ 1.0000e-16 ┆ [1]       │
# │ a   ┆ 1.0005e-42 ┆ [0, 2]    │
# └─────┴────────────┴───────────┘

但也许我还遗漏了其他一些东西over？

1 个回答

Voted

orlp · Answer 1 · 2024-09-11T05:36:43+08:00

目前Polars 并不区分Scalars 和Series，任何Series长度为 1 的都被视为标量，与其他系列结合时会被广播。

我们正在积极努力更好地区分这两个概念，一旦我们做到了这一点，我希望上下文Expr.implode()中的列表聚合over能够按您预期的方式工作。也就是说，以下内容应该可以解决您的问题，但目前无法解决：

>>> df.with_columns(pl.col.x.implode().over("id")))

目前，我建议进行正常的聚合 + 连接：

>>> df.drop("x").join(df.group_by("id").agg("x"), on="id")
shape: (3, 3)
┌─────┬────────────┬───────────┐
│ id  ┆ other      ┆ x         │
│ --- ┆ ---        ┆ ---       │
│ str ┆ f32        ┆ list[i64] │
╞═════╪════════════╪═══════════╡
│ a   ┆ 9.9997e-42 ┆ [0, 2]    │
│ b   ┆ 1.0000e-16 ┆ [1]       │
│ a   ┆ 1.0005e-42 ┆ [0, 2]    │
└─────┴────────────┴───────────┘

使用 Polars，如何有效地执行“over”操作以将项目收集到列表中？

`(表达式，左值) = 右值` 在 C 或 C++ 中是有效的赋值吗？为什么有些编译器会接受/拒绝它？

何时应使用 std::inplace_vector 而不是 std::vector？

在 C++ 中，一个不执行任何操作的空程序需要 204KB 的堆，但在 C 中则不需要

如果 T 既不可构造、不可复制、也不可移动，那么我可以拥有 std::optional<T> 吗？

为什么我可以定义一个 constinit 的 std::string 实例？如果对象需要动态初始化，constinit 不是被禁止的吗？

如何分配以后放置的新“如同新”

PowerBI 目前与 BigQuery 不兼容：Simba 驱动程序与 Windows 更新有关

将 NULL 和 nullptr 传递给模板参数有什么区别？

AdMob：MobileAds.initialize() - 对于某些设备，“java.lang.Integer 无法转换为 java.lang.String”

我正在尝试仅使用海龟随机和数学模块来制作吃豆人游戏

使用 Polars，如何有效地执行“over”操作以将项目收集到列表中？

1 个回答

相关问题