Kevin提出的问题 -coding

Kevin

Asked: 2024-09-27 09:17:40 +0800 CST

Polars 数据框：如何有效地聚合许多不相交的组

我有一个包含列的数据框，其中xK有点大（K ≈ 1000 或 2000）。yc_1c2c_K

每列都是c_i布尔列，我想计算f(x, y)为 True 的行的聚合c_i。（例如，f(x,y) = x.sum() * y.sum()。）

一种方法是：

ds.select([
    f(pl.col("x").filter(pl.col(f"c_{i+1}"), pl.col("y").filter(pl.col(f"c_{i+1}"))
    for i in range(K)
])

在我的问题中，数字K很大，并且上述查询似乎效率低下（过滤进行了两次）。

实现这一目标的推荐/最有效/最优雅的方法是什么？

编辑。

这是一个可运行的示例（代码在底部），以及一些与下面@Hericks 的答案相对应的时间。TLDR：建议的方法 1是目前最好的。

		墙上时间
1	重复过滤	409毫秒
2	`pl.concat`	29.6秒（≈慢70倍）
2*	`pl.concat`，懒惰的	1.27 秒（慢 3 倍）
3	与骨料一起融化	1分17秒
3*	融化成 agg，懒惰	1分17秒（与3相同）

import polars as pl
import polars.selectors as cs
import numpy as np
rng = np.random.default_rng()

def f(x,y):
    return x.sum() * y.sum()

N = 2_000_000
K = 1000
dat = dict()
dat["x"] = np.random.randn(N)
dat["y"] = np.random.randn(N)
for i in range(K):
    dat[f"c_{i+1}"] = rng.choice(2, N).astype(np.bool_)

tmpds = pl.DataFrame(dat)


## Method 1
tmpds.select([
    f(
        pl.col("x").filter(pl.col(f"c_{i+1}")),
        pl.col("y").filter(pl.col(f"c_{i+1}")))
    .alias(f"f_{i+1}") for i in range(K)
])

## Method 2
pl.concat([
    tmpds.filter(pl.col(f"c_{i+1}")).select(f(pl.col("x"), pl.col("y")).alias(f"f_{i+1}"))
    for i in range(K)
], how="horizontal")

## Method 2*
pl.concat([
    tmpds.lazy().filter(pl.col(f"c_{i+1}")).select(f(pl.col("x"), pl.col("y")).alias(f"f_{i+1}")).collect()
    for i in range(K)
], how="horizontal")

## Method 3
(
    tmpds
    .unpivot(on=cs.starts_with("c"), index=["x", "y"])
    .filter("value")
    .group_by("variable")
    .agg(
        f(pl.col("x"), pl.col("y"))
    )
)

##Method 3*
(
    tmpds
    .lazy()
    .unpivot(on=cs.starts_with("c"), index=["x", "y"])
    .filter("value")
    .group_by("variable", maintain_order=True)
    .agg(
        f(pl.col("x"), pl.col("y"))
    )
    .collect()
)

Polars 数据框：如何有效地聚合许多不相交的组

为什么要通过 where 子句中绑定的通用特征来约束单位类型（如 `where () : Trait<…>`）？

`(表达式，左值) = 右值` 在 C 或 C++ 中是有效的赋值吗？为什么有些编译器会接受/拒绝它？

何时应使用 std::inplace_vector 而不是 std::vector？

在 C++ 中，一个不执行任何操作的空程序需要 204KB 的堆，但在 C 中则不需要

如果 T 既不可构造、不可复制、也不可移动，那么我可以拥有 std::optional<T> 吗？

为什么我可以定义一个 constinit 的 std::string 实例？如果对象需要动态初始化，constinit 不是被禁止的吗？

如何分配以后放置的新“如同新”

PowerBI 目前与 BigQuery 不兼容：Simba 驱动程序与 Windows 更新有关

AdMob：MobileAds.initialize() - 对于某些设备，“java.lang.Integer 无法转换为 java.lang.String”

我正在尝试仅使用海龟随机和数学模块来制作吃豆人游戏

Kevin's questions