作为一个简单的示例,考虑以下内容,使用groupby
:
import polars as pl
df = pl.DataFrame(
[pl.Series("id", ["a", "b", "a"]), pl.Series("x", [0, 1, 2])]
)
print(df.group_by("id").agg(pl.col("x")))
# shape: (2, 2)
# ┌─────┬───────────┐
# │ id ┆ x │
# │ --- ┆ --- │
# │ str ┆ list[i64] │
# ╞═════╪═══════════╡
# │ b ┆ [1] │
# │ a ┆ [0, 2] │
# └─────┴───────────┘
但如果我们使用over
,我们会得到:
import polars as pl
df = pl.DataFrame(
[pl.Series("id", ["a", "b", "a"]), pl.Series("x", [0, 1, 2])]
)
print(df.with_columns(pl.col("x").over("id")))
# shape: (3, 2)
# ┌─────┬─────┐
# │ id ┆ x │
# │ --- ┆ --- │
# │ str ┆ i64 │
# ╞═════╪═════╡
# │ a ┆ 0 │
# │ b ┆ 1 │
# │ a ┆ 2 │
# └─────┴─────┘
如何groupby
使用 实现结果over
?嗯,使用mapping_strategy="join"
。
一个稍微复杂一点的例子,旨在展示为什么我们可能想要使用over
而不是groupby
:
import polars as pl
# the smallest value a Float32 can encode is 1e-38
# therefore, as far as we are concerned,
# 1e-41 and 1e-42 should be indistinguishable
# in other words, we do not want to use "other" as an id column
# but we do want to preserve other!
df = pl.DataFrame(
[
pl.Series("id", ["a", "b", "a"]),
pl.Series("other", [1e-41, 1e-16, 1e-42], dtype=pl.Float32()),
pl.Series("x", [0, 1, 2]),
]
)
print(df.group_by("id").agg(pl.col("x"), pl.col("other")).explode("other"))
# shape: (3, 3)
# ┌─────┬───────────┬────────────┐
# │ id ┆ x ┆ other │
# │ --- ┆ --- ┆ --- │
# │ str ┆ list[i64] ┆ f32 │
# ╞═════╪═══════════╪════════════╡
# │ a ┆ [0, 2] ┆ 9.9997e-42 │
# │ a ┆ [0, 2] ┆ 1.0005e-42 │
# │ b ┆ [1] ┆ 1.0000e-16 │
# └─────┴───────────┴────────────┘
现在,使用over
:
import polars as pl
# the smallest value a Float32 can encode is 1e-38
# therefore, as far as we are concerned,
# 1e-41 and 1e-42 should be indistinguishable
# in other words, we do not want to use "other" as an id column
# but we do want to preserve other!
df = pl.DataFrame(
[
pl.Series("id", ["a", "b", "a"]),
pl.Series("other", [1e-41, 1e-16, 1e-42], dtype=pl.Float32()),
pl.Series("x", [0, 1, 2]),
]
)
print(df.with_columns(pl.col("x").over(["id"], mapping_strategy="join")))
# shape: (3, 3)
# ┌─────┬────────────┬───────────┐
# │ id ┆ other ┆ x │
# │ --- ┆ --- ┆ --- │
# │ str ┆ f32 ┆ list[i64] │
# ╞═════╪════════════╪═══════════╡
# │ a ┆ 9.9997e-42 ┆ [0, 2] │
# │ b ┆ 1.0000e-16 ┆ [1] │
# │ a ┆ 1.0005e-42 ┆ [0, 2] │
# └─────┴────────────┴───────────┘
问题mapping_strategy="join"
是使用起来很慢。所以,这意味着我应该先执行 a,group_by
然后执行 a join
:
import polars as pl
import polars.selectors as cs
# the smallest value a Float32 can encode is 1e-38
# therefore, as far as we are concerned,
# 1e-41 and 1e-42 should be indistinguishable
# in other words, we do not want to use "other" as an id column
# but we do want to preserve other!
df = pl.DataFrame(
[
pl.Series("id", ["a", "b", "a"]),
pl.Series("other", [1e-41, 1e-16, 1e-42], dtype=pl.Float32()),
pl.Series("x", [0, 1, 2]),
]
)
print(
df.select(cs.exclude("x")).join(
df.group_by("id").agg("x"),
on="id",
# we expect there to be multiple "id"s on the left, matching
# a single "id" on the right
validate="m:1",
)
)
# shape: (3, 3)
# ┌─────┬────────────┬───────────┐
# │ id ┆ other ┆ x │
# │ --- ┆ --- ┆ --- │
# │ str ┆ f32 ┆ list[i64] │
# ╞═════╪════════════╪═══════════╡
# │ a ┆ 9.9997e-42 ┆ [0, 2] │
# │ b ┆ 1.0000e-16 ┆ [1] │
# │ a ┆ 1.0005e-42 ┆ [0, 2] │
# └─────┴────────────┴───────────┘
但也许我还遗漏了其他一些东西over
?
目前Polars 并不区分
Scalar
s 和Series
,任何Series
长度为 1 的都被视为标量,与其他系列结合时会被广播。我们正在积极努力更好地区分这两个概念,一旦我们做到了这一点,我希望上下文
Expr.implode()
中的列表聚合over
能够按您预期的方式工作。也就是说,以下内容应该可以解决您的问题,但目前无法解决:目前,我建议进行正常的聚合 + 连接: