如何将 for 循环拆分为 3 个单独的数据框？

Question

Quinten

Asked: 2025-02-11 17:57:15 +0800 CST2025-02-11 17:57:15 +0800 CST 2025-02-11 17:57:15 +0800 CST

选择 Polars 数据框中每组的第一行和最后一行

772

我正在尝试使用polars数据框来选择每组的first第一last行。这是一个选择每组第一行的简单示例：

import polars as pl

df = pl.DataFrame(
    {
        "a": [1, 2, 2, 3, 4, 5],
        "b": [0.5, 0.5, 4, 10, 14, 13],
        "c": [True, True, True, False, False, True],
        "d": ["Apple", "Apple", "Apple", "Banana", "Banana", "Banana"],
    }
)
result = df.group_by("d", maintain_order=True).first()
print(result)

输出：

shape: (2, 4)
┌────────┬─────┬──────┬───────┐
│ d      ┆ a   ┆ b    ┆ c     │
│ ---    ┆ --- ┆ ---  ┆ ---   │
│ str    ┆ i64 ┆ f64  ┆ bool  │
╞════════╪═════╪══════╪═══════╡
│ Apple  ┆ 1   ┆ 0.5  ┆ true  │
│ Banana ┆ 3   ┆ 10.0 ┆ false │
└────────┴─────┴──────┴───────┘

这个方法很好，我们可以用.last它来处理最后一行。但是我们如何将它们合并在一起呢group_by？

3 个回答

Voted

mozway · Answer 1 · 2025-02-11T18:00:38+08:00

作为列

您可以使用agg，您必须添加suffix（或prefix）来区分列名称：

result = (df.group_by('d', maintain_order=True)
            .agg(pl.all().first().name.suffix('_first'),
                 pl.all().last().name.suffix('_last'))
         )

输出：

┌────────┬─────────┬─────────┬─────────┬────────┬────────┬────────┐
│ d      ┆ a_first ┆ b_first ┆ c_first ┆ a_last ┆ b_last ┆ c_last │
│ ---    ┆ ---     ┆ ---     ┆ ---     ┆ ---    ┆ ---    ┆ ---    │
│ str    ┆ i64     ┆ f64     ┆ bool    ┆ i64    ┆ f64    ┆ bool   │
╞════════╪═════════╪═════════╪═════════╪════════╪════════╪════════╡
│ Apple  ┆ 1       ┆ 0.5     ┆ true    ┆ 2      ┆ 4.0    ┆ true   │
│ Banana ┆ 3       ┆ 10.0    ┆ false   ┆ 5      ┆ 13.0   ┆ true   │
└────────┴─────────┴─────────┴─────────┴────────┴────────┴────────┘

作为行

如果您想要多行，那么您需要concat：

g = df.group_by('d', maintain_order=True)

result = pl.concat([g.first(), g.last()]).sort(by='d', maintain_order=True)

输出：

┌────────┬─────┬──────┬───────┐
│ d      ┆ a   ┆ b    ┆ c     │
│ ---    ┆ --- ┆ ---  ┆ ---   │
│ str    ┆ i64 ┆ f64  ┆ bool  │
╞════════╪═════╪══════╪═══════╡
│ Apple  ┆ 1   ┆ 0.5  ┆ true  │
│ Apple  ┆ 2   ┆ 4.0  ┆ true  │
│ Banana ┆ 3   ┆ 10.0 ┆ false │
│ Banana ┆ 5   ┆ 13.0 ┆ true  │
└────────┴─────┴──────┴───────┘

或者filter使用int_range+ over：

result = df.filter((pl.int_range(pl.len()).over('d') == 0)
                  |(pl.int_range(pl.len(), 0, -1).over('d') == 1)
                  )

输出：

┌─────┬──────┬───────┬────────┐
│ a   ┆ b    ┆ c     ┆ d      │
│ --- ┆ ---  ┆ ---   ┆ ---    │
│ i64 ┆ f64  ┆ bool  ┆ str    │
╞═════╪══════╪═══════╪════════╡
│ 1   ┆ 0.5  ┆ true  ┆ Apple  │
│ 2   ┆ 4.0  ┆ true  ┆ Apple  │
│ 3   ┆ 10.0 ┆ false ┆ Banana │
│ 5   ┆ 13.0 ┆ true  ┆ Banana │
└─────┴──────┴───────┴────────┘

Hericks · Answer 2 · 2025-02-11T20:18:56+08:00

Hericks

2025-02-11T20:18:56+08:002025-02-11T20:18:56+08:00

@mozway 的解决方案很有效！为了完整起见，我还想分享两个依赖于的解决方案pl.Expr.gather。

在选定上下文中

df.select(
    pl.all().gather([0, -1]).over("d", mapping_strategy="explode")
)

在 group-by 上下文中

(
    df
    .group_by("d", maintain_order=True)
    .agg(
        pl.all().gather([0, -1])
    )
    .explode(pl.exclude("d"))
)

性能注意事项

我还对这些方法进行了初步计时（在微小的示例数据集上）。

方法	时间（7 次运行的平均值 ± 标准差，每次 1,000 次循环）
`group_by`+`concat`	每循环 452 μs ± 7.34 μs
`filter`	每环 396 μs ± 10.2 μs
`group_by`+`gather`	每循环 255 μs ± 4.09 μs
`select`+`gather`	每循环 172 μs ± 1.29 μs

2

jqurious · Answer 3 · 2025-02-11T21:57:47+08:00

jqurious

2025-02-11T21:57:47+08:002025-02-11T21:57:47+08:00

有专用的第一个/最后一个方法。

df.filter(
    pl.any_horizontal(
        pl.col("d").is_first_distinct(),
        pl.col("d").is_last_distinct()
    )
)

shape: (4, 4)
┌─────┬──────┬───────┬────────┐
│ a   ┆ b    ┆ c     ┆ d      │
│ --- ┆ ---  ┆ ---   ┆ ---    │
│ i64 ┆ f64  ┆ bool  ┆ str    │
╞═════╪══════╪═══════╪════════╡
│ 1   ┆ 0.5  ┆ true  ┆ Apple  │
│ 2   ┆ 4.0  ┆ true  ┆ Apple  │
│ 3   ┆ 10.0 ┆ false ┆ Banana │
│ 5   ┆ 13.0 ┆ true  ┆ Banana │
└─────┴──────┴───────┴────────┘

如果组标识符是多列，则可以使用结构体。

pl.struct("c", "d").is_first_distinct()

1

选择 Polars 数据框中每组的第一行和最后一行

作为列

作为行

在选定上下文中

在 group-by 上下文中

性能注意事项

Vue 3：创建时出错“预期标识符但发现‘导入’”[重复]

为什么这个简单而小的 Java 代码在所有 Graal JVM 上的运行速度都快 30 倍，但在任何 Oracle JVM 上却不行？

具有指定基础类型但没有枚举器的“枚举类”的用途是什么？

如何修复未手动导入的模块的 MODULE_NOT_FOUND 错误？

`(表达式，左值) = 右值` 在 C 或 C++ 中是有效的赋值吗？为什么有些编译器会接受/拒绝它？

何时应使用 std::inplace_vector 而不是 std::vector？

在 C++ 中，一个不执行任何操作的空程序需要 204KB 的堆，但在 C 中则不需要

PowerBI 目前与 BigQuery 不兼容：Simba 驱动程序与 Windows 更新有关

AdMob：MobileAds.initialize() - 对于某些设备，“java.lang.Integer 无法转换为 java.lang.String”

我正在尝试仅使用海龟随机和数学模块来制作吃豆人游戏

选择 Polars 数据框中每组的第一行和最后一行

3 个回答

作为列

作为行

在选定上下文中

在 group-by 上下文中

性能注意事项

相关问题