Phil-ZXX提出的问题 -coding

Phil-ZXX

Asked: 2025-04-14 23:57:58 +0800 CST

polars DataFrame 中的 group-by 列

6

我有以下数据框：

import polars as pl

df = pl.DataFrame({
    'ID': [1, 1, 5, 5, 7, 7, 7],
    'YEAR': [2025, 2025, 2023, 2024, 2020, 2021, 2021]
})
shape: (7, 2)
┌─────┬──────┐
│ ID  ┆ YEAR │
│ --- ┆ ---  │
│ i64 ┆ i64  │
╞═════╪══════╡
│ 1   ┆ 2025 │
│ 1   ┆ 2025 │
│ 5   ┆ 2023 │
│ 5   ┆ 2024 │
│ 7   ┆ 2020 │
│ 7   ┆ 2021 │
│ 7   ┆ 2021 │
└─────┴──────┘

现在我想获得每个 ID 的唯一年数，即

shape: (7, 3)
┌─────┬──────┬──────────────┐
│ ID  ┆ YEAR ┆ UNIQUE_YEARS │
│ --- ┆ ---  ┆ ---          │
│ i64 ┆ i64  ┆ u32          │
╞═════╪══════╪══════════════╡
│ 1   ┆ 2025 ┆ 1            │
│ 1   ┆ 2025 ┆ 1            │
│ 5   ┆ 2023 ┆ 2            │
│ 5   ┆ 2024 ┆ 2            │
│ 7   ┆ 2020 ┆ 2            │
│ 7   ┆ 2021 ┆ 2            │
│ 7   ┆ 2021 ┆ 2            │
└─────┴──────┴──────────────┘

我尝试了一下df.with_columns(pl.col('YEAR').over('ID').alias('UNIQUE_YEARS'))，但结果却不对。所以我想到了

df.join(df.group_by('ID').agg(pl.col('YEAR').unique().len().alias('UNIQUE_YEARS')), on='ID', how='left')

这确实给出了正确的结果！但它看起来有点笨重，我想知道有没有更自然的方式使用with_columns和over？

Phil-ZXX

Asked: 2024-11-27 00:11:28 +0800 CST

根据行数据对 Polars Dataframe 列进行排序

4

我有这些数据：

import polars as pl

pl.DataFrame({
    'region': ['EU', 'ASIA', 'AMER', 'Year'],
    'Share': [99, 6, -30, 2020],
    'Ration': [70, 4, -10, 2019],
    'Lots': [70, 4, -10, 2018],
    'Stake': [80, 5, -20, 2021],
})
# shape: (4, 5)
# ┌────────┬───────┬────────┬──────┬───────┐
# │ region ┆ Share ┆ Ration ┆ Lots ┆ Stake │
# │ ---    ┆ ---   ┆ ---    ┆ ---  ┆ ---   │
# │ str    ┆ i64   ┆ i64    ┆ i64  ┆ i64   │
# ╞════════╪═══════╪════════╪══════╪═══════╡
# │ EU     ┆ 99    ┆ 70     ┆ 70   ┆ 80    │
# │ ASIA   ┆ 6     ┆ 4      ┆ 4    ┆ 5     │
# │ AMER   ┆ -30   ┆ -10    ┆ -10  ┆ -20   │
# │ Year   ┆ 2020  ┆ 2019   ┆ 2018 ┆ 2021  │
# └────────┴───────┴────────┴──────┴───────┘

我想根据行对列进行排序Year，同时将region列放在第一位。所以理想情况下我正在寻找这个：

shape: (4, 5)
┌────────┬──────┬────────┬───────┬───────┐
│ region ┆ Lots ┆ Ration ┆ Share ┆ Stake │
│ ---    ┆ ---  ┆ ---    ┆ ---   ┆ ---   │
│ str    ┆ i64  ┆ i64    ┆ i64   ┆ i64   │
╞════════╪══════╪════════╪═══════╪═══════╡
│ EU     ┆ 70   ┆ 70     ┆ 99    ┆ 80    │
│ ASIA   ┆ 4    ┆ 4      ┆ 6     ┆ 5     │
│ AMER   ┆ -10  ┆ -10    ┆ -30   ┆ -20   │
│ Year   ┆ 2018 ┆ 2019   ┆ 2020  ┆ 2021  │
└────────┴──────┴────────┴───────┴───────┘

如何实现？我尝试使用 polarssort函数，但无法实现我需要的功能。

Phil-ZXX

Asked: 2024-10-13 04:29:48 +0800 CST

Polars Pivot 在求和时将空值视为 0

7

我有这个代码：

import polars as pl

pl.DataFrame({
    'label':   ['AA', 'CC', 'BB', 'AA', 'CC'],
    'account': ['EU', 'US', 'US', 'EU', 'EU'],
    'qty':     [1.5,  43.2, None, None, 18.9]})\
  .pivot('account', index='label', aggregate_function='sum')

给出

shape: (3, 3)
┌───────┬──────┬──────┐
│ label ┆ EU   ┆ US   │
│ ---   ┆ ---  ┆ ---  │
│ str   ┆ f64  ┆ f64  │
╞═══════╪══════╪══════╡
│ AA    ┆ 1.5  ┆ null │
│ CC    ┆ 18.9 ┆ 43.2 │
│ BB    ┆ null ┆ 0.0  │
└───────┴──────┴──────┘

null现在，当原始数据中有任何值时，我希望数据透视表显示null在相应的单元格中。但是，AA-EU 显示 1.5（但应该为空），BB-US 显示 0.0（但也应该为空）。

我尝试使用

aggregate_function=lambda col: pl.when(col.has_nulls())\
                                 .then(pl.lit(None, dtype=pl.Float64))\
                                 .otherwise(pl.sum(col))

但会出现错误AttributeError: 'function' object has no attribute '_pyexpr'。

我该如何修复此问题？

Phil-ZXX

Asked: 2024-09-12 19:17:14 +0800 CST

对极坐标数据框中同名（或“键”）的列求和

7

我有这个代码

import polars as pl

pl.DataFrame({
    'id': ['CHECK.US1', 'CHECK.US2', 'CHECK.CA9'],
    'libor.M2': [99, 332, 934],
    'libor.Y5': [11, -10, 904],
    'estr.M2':  [99, 271, 741],
    'estr.Y3':  [-8, -24, 183],
    'estr.Y5':  [88, 771, 455]
})

给出

┌───────────┬──────────┬──────────┬─────────┬─────────┬─────────┐
│ id        ┆ libor.M2 ┆ libor.Y5 ┆ estr.M2 ┆ estr.Y3 ┆ estr.Y5 │
│ ---       ┆ ---      ┆ ---      ┆ ---     ┆ ---     ┆ ---     │
│ str       ┆ i64      ┆ i64      ┆ i64     ┆ i64     ┆ i64     │
╞═══════════╪══════════╪══════════╪═════════╪═════════╪═════════╡
│ CHECK.US1 ┆ 99       ┆ 11       ┆ 99      ┆ -8      ┆ 88      │
│ CHECK.US2 ┆ 332      ┆ -10      ┆ 271     ┆ -24     ┆ 771     │
│ CHECK.CA9 ┆ 934      ┆ 904      ┆ 741     ┆ 183     ┆ 455     │
└───────────┴──────────┴──────────┴─────────┴─────────┴─────────┘

现在我想做的是将列重命名为较短的名称，例如

┌───────────┬──────┬──────┬─────┬─────┬─────┐
│ id        ┆ M2   ┆ Y5   ┆ M2  ┆ Y3  ┆ Y5  │
│ ---       ┆ ---  ┆ ---  ┆ --- ┆ --- ┆ --- │
│ str       ┆ i64  ┆ i64  ┆ i64 ┆ i64 ┆ i64 │
╞═══════════╪══════╪══════╪═════╪═════╪═════╡
or
┌───────────┬──────┬──────┬─────┬─────┬─────┐
│ id        ┆ libor┆ libor┆ estr┆ estr┆ estr│
│ ---       ┆ ---  ┆ ---  ┆ --- ┆ --- ┆ --- │
│ str       ┆ i64  ┆ i64  ┆ i64 ┆ i64 ┆ i64 │
╞═══════════╪══════╪══════╪═════╪═════╪═════╡

然后在具有相同名称的列上折叠（=总和），这样我得到例如

┌───────────┬──────┬──────┬──────┐
│ id        ┆ M2   ┆ Y5   ┆ Y3   │
│ ---       ┆ ---  ┆ ---  ┆ ---  │
│ str       ┆ i64  ┆ i64  ┆ i64  │
╞═══════════╪══════╪══════╪══════╡
│ CHECK.US1 ┆ 198  ┆ 99   ┆ -8   │
│ CHECK.US2 ┆ 603  ┆ 761  ┆ -24  │
│ CHECK.CA9 ┆ 1675 ┆ 1359 ┆ 183  │
└───────────┴──────┴──────┴──────┘

我首先尝试重命名它们，但是得到了polars.exceptions.DuplicateError: the name 'M2' is duplicate。

有没有什么办法可以实现我想要做的事情？

编辑：我也尝试过类似

rename_func = lambda col: col.split('.')[-1]
new_cols = set([rename_func(c) for c in df.columns])

df.with_columns([
  pl.sum_horizontal(pl.all().map(rename_func) == c).alias(c) for c in new_cols
])

但它不太起作用。

Phil-ZXX

Asked: 2024-09-11 21:17:40 +0800 CST

Polars pl.col(field).name.map_fields 适用于所有结构列（不是指定的列）

7

我有这个代码：

import polars as pl

cols = ['Delta', 'Qty']

metrics = {'CHECK.US': {'Delta': {'ABC': 1, 'DEF': 2}, 'Qty': {'GHIJ': 3, 'TT': 4}},
           'CHECK.NA': {},
           'CHECK.FR': {'Delta': {'QQQ': 7, 'ABC': 6}, 'Qty': {'SS': 9, 'TT': 5}}
          }

df = pl.DataFrame([{col: v.get(col) for col in cols} for v in metrics.values()])\
       .insert_column(0, pl.Series('key', metrics.keys()))\
       .with_columns([pl.col(col).name.map_fields(lambda x: f'{col} ({x})') for col in cols])

现在，df.unnest('Qty')正确给出所有列的格式为Qty (xxx)：

shape: (3, 5)
┌──────────┬────────────┬────────────┬──────────┬──────────┐
│ key      ┆ Delta      ┆ Qty (GHIJ) ┆ Qty (TT) ┆ Qty (SS) │
│ ---      ┆ ---        ┆ ---        ┆ ---      ┆ ---      │
│ str      ┆ struct[3]  ┆ i64        ┆ i64      ┆ i64      │
╞══════════╪════════════╪════════════╪══════════╪══════════╡
│ CHECK.US ┆ {1,2,null} ┆ 3          ┆ 4        ┆ null     │
│ CHECK.NA ┆ null       ┆ null       ┆ null     ┆ null     │
│ CHECK.FR ┆ {6,null,7} ┆ null       ┆ 5        ┆ 9        │
└──────────┴────────────┴────────────┴──────────┴──────────┘

但是，当我执行同样的事情时，df.unnest('Delta')它会错误地返回以下列Qty (xxx)：

shape: (3, 5)
┌──────────┬───────────┬───────────┬───────────┬────────────┐
│ key      ┆ Qty (ABC) ┆ Qty (DEF) ┆ Qty (QQQ) ┆ Qty        │
│ ---      ┆ ---       ┆ ---       ┆ ---       ┆ ---        │
│ str      ┆ i64       ┆ i64       ┆ i64       ┆ struct[3]  │
╞══════════╪═══════════╪═══════════╪═══════════╪════════════╡
│ CHECK.US ┆ 1         ┆ 2         ┆ null      ┆ {3,4,null} │
│ CHECK.NA ┆ null      ┆ null      ┆ null      ┆ null       │
│ CHECK.FR ┆ 6         ┆ null      ┆ 7         ┆ {null,5,9} │
└──────────┴───────────┴───────────┴───────────┴────────────┘

值看起来正确，只是列名错误。

我使用的pl.col(col).name.map_field(...)方式不正确吗？如何修复代码，使输出变成这样：

shape: (3, 5)
┌──────────┬─────────────┬─────────────┬─────────────┬────────────┐
│ key      ┆ Delta (ABC) ┆ Delta (DEF) ┆ Delta (QQQ) ┆ Qty        │
│ ---      ┆ ---         ┆ ---         ┆ ---         ┆ ---        │
│ str      ┆ i64         ┆ i64         ┆ i64         ┆ struct[3]  │
╞══════════╪═════════════╪═════════════╪═════════════╪════════════╡

？

Phil-ZXX

Asked: 2024-09-05 04:49:48 +0800 CST

使用格式说明符将极坐标数据框中的浮点数/整数列转换为字符串

8

我有这个代码：

import polars as pl
df = pl.DataFrame({'size': [34.2399, 1232.22, -479.1]})
df.with_columns(pl.format('{:,.2f}', pl.col('size')))

但失败了：

ValueError - Traceback, line 3
      2 df = pl.DataFrame({'size': [34.2399, 1232.22, -479.1]})
----> 3 df.with_columns(pl.format('{:,.2f}', pl.col('size')))

File polars\functions\as_datatype.py:718, in format(f_string, *args)
    717     msg = "number of placeholders should equal the number of arguments"
--> 718     raise ValueError(msg)

ValueError: number of placeholders should equal the number of arguments

如何使用类似格式说明符来格式化float或列？int'{:,.2f}'

Phil-ZXX

Asked: 2024-09-03 22:07:15 +0800 CST

在多列上分解极坐标行，但逻辑不同

7

我有这段代码，它将一product列拆分为一个列表，然后用它explode来扩展它：

import polars as pl
import datetime as dt
from dateutil.relativedelta import relativedelta

def get_3_month_splits(product: str) -> list[str]:
    front, start_dt, total_m = product.rsplit('.', 2)
    start_dt = dt.datetime.strptime(start_dt, '%Y%m')
    total_m  = int(total_m)
    return [f'{front}.{(start_dt+relativedelta(months=m)).strftime("%Y%m")}.3' for m in range(0, total_m, 3)]

df = pl.DataFrame({
    'product':    ['CHECK.GB.202403.12', 'CHECK.DE.202506.6', 'CASH.US.202509.12'],
    'qty':        [10, -20, 50],
    'price_paid': [1400, -3300, 900],
})

print(df.with_columns(pl.col('product').map_elements(get_3_month_splits, return_dtype=pl.List(str))).explode('product'))

目前这给出

shape: (10, 3)
┌───────────────────┬─────┬────────────┐
│ product           ┆ qty ┆ price_paid │
│ ---               ┆ --- ┆ ---        │
│ str               ┆ i64 ┆ i64        │
╞═══════════════════╪═════╪════════════╡
│ CHECK.GB.202403.3 ┆ 10  ┆ 1400       │
│ CHECK.GB.202406.3 ┆ 10  ┆ 1400       │
│ CHECK.GB.202409.3 ┆ 10  ┆ 1400       │
│ CHECK.GB.202412.3 ┆ 10  ┆ 1400       │
│ CHECK.DE.202506.3 ┆ -20 ┆ -3300      │
│ CHECK.DE.202509.3 ┆ -20 ┆ -3300      │
│ CASH.US.202509.3  ┆ 50  ┆ 900        │
│ CASH.US.202512.3  ┆ 50  ┆ 900        │
│ CASH.US.202603.3  ┆ 50  ┆ 900        │
│ CASH.US.202606.3  ┆ 50  ┆ 900        │
└───────────────────┴─────┴────────────┘

但是，我想保持总数price paid不变。因此，在将行拆分为几个“子类别”后，我想将表格更改为：

shape: (10, 3)
┌───────────────────┬─────┬────────────┐
│ product           ┆ qty ┆ price_paid │
│ ---               ┆ --- ┆ ---        │
│ str               ┆ i64 ┆ i64        │
╞═══════════════════╪═════╪════════════╡
│ CHECK.GB.202403.3 ┆ 10  ┆ 1400       │
│ CHECK.GB.202406.3 ┆ 10  ┆ 0          │
│ CHECK.GB.202409.3 ┆ 10  ┆ 0          │
│ CHECK.GB.202412.3 ┆ 10  ┆ 0          │
│ CHECK.DE.202506.3 ┆ -20 ┆ -3300      │
│ CHECK.DE.202509.3 ┆ -20 ┆ 0          │
│ CASH.US.202509.3  ┆ 50  ┆ 900        │
│ CASH.US.202512.3  ┆ 50  ┆ 0          │
│ CASH.US.202603.3  ┆ 50  ┆ 0          │
│ CASH.US.202606.3  ┆ 50  ┆ 0          │
└───────────────────┴─────┴────────────┘

即只将保留price_paid在第一个展开行中。因此我支付的总价保持不变。qty保持原样就可以了。

我尝试了例如with_columns(price_arr=pl.col('product').cast(pl.List(pl.Float64)))但无法向列表的第一个元素添加任何内容。或者with_columns(price_arr=pl.col(['product', 'price_paid']).map_elements(price_func))但似乎无法map_elements使用pl.col([...])。

Phil-ZXX

Asked: 2024-08-20 23:33:20 +0800 CST

Polars Dataframe 在多个无后缀的列上进行全连接（外部）

7

我有这个代码：

import polars as pl

df1 = pl.DataFrame({
    'type':   ['A', 'O', 'B', 'O'],
    'origin': ['EU', 'US', 'US', 'EU'],
    'qty1':   [343,11,22,-5]
})

df2 = pl.DataFrame({
    'type':   ['A', 'O', 'B', 'S'],
    'origin': ['EU', 'US', 'US', 'AS'],
    'qty2':   [-200,-12,-25,8]
})

df1.join(df2, on=['type', 'origin'], how='full')

给出

┌──────┬────────┬──────┬────────────┬──────────────┬──────┐
│ type ┆ origin ┆ qty1 ┆ type_right ┆ origin_right ┆ qty2 │
│ ---  ┆ ---    ┆ ---  ┆ ---        ┆ ---          ┆ ---  │
│ str  ┆ str    ┆ i64  ┆ str        ┆ str          ┆ i64  │
╞══════╪════════╪══════╪════════════╪══════════════╪══════╡
│ A    ┆ EU     ┆ 343  ┆ A          ┆ EU           ┆ -200 │
│ O    ┆ US     ┆ 11   ┆ O          ┆ US           ┆ -12  │
│ B    ┆ US     ┆ 22   ┆ B          ┆ US           ┆ -25  │
│ null ┆ null   ┆ null ┆ S          ┆ AS           ┆ 8    │
│ O    ┆ EU     ┆ -5   ┆ null       ┆ null         ┆ null │
└──────┴────────┴──────┴────────────┴──────────────┴──────┘

但我想要的输出是这样的：

┌──────┬────────┬──────┬──────┐
│ type ┆ origin ┆ qty1 ┆ qty2 │
│ ---  ┆ ---    ┆ ---  ┆ ---  │
│ str  ┆ str    ┆ i64  ┆ i64  │
╞══════╪════════╪══════╪══════╡
│ A    ┆ EU     ┆ 343  ┆ -200 │
│ O    ┆ US     ┆ 11   ┆ -12  │
│ B    ┆ US     ┆ 22   ┆ -25  │
│ S    ┆ AS     ┆ null ┆ 8    │
│ O    ┆ EU     ┆ -5   ┆ null │
└──────┴────────┴──────┴──────┘

我尝试suffix=''通过df1.join(df2, on=['type', 'origin'], how='full', suffix='')，但这引发了一个错误：

DuplicateError: unable to hstack, column with name "type" already exists

我怎样才能实现这个目标？

Phil-ZXX

Asked: 2024-08-20 20:54:52 +0800 CST

极坐标拆分列并获取第 n 个（或最后一个）元素

7

我有以下代码和输出。

代码。

import polars as pl

df = pl.DataFrame({
    'type': ['A', 'O', 'B', 'O'],
    'id':   ['CASH', 'ORB.A123', 'CHECK', 'OTC.BV32']
})

df.with_columns(sub_id=pl.when(pl.col('type') == 'O').then(pl.col('id').str.split('.')).otherwise(None))

输出。

shape: (4, 3)
┌──────┬──────────┬─────────────────┐
│ type ┆ id       ┆ sub_id          │
│ ---  ┆ ---      ┆ ---             │
│ str  ┆ str      ┆ list[str]       │
╞══════╪══════════╪═════════════════╡
│ A    ┆ CASH     ┆ null            │
│ O    ┆ ORB.A123 ┆ ["ORB", "A123"] │
│ B    ┆ CHECK    ┆ null            │
│ O    ┆ OTC.BV32 ┆ ["OTC", "BV32"] │
└──────┴──────────┴─────────────────┘

现在，我该如何提取每个列表的第 n 个元素（或在本例中为最后一个元素）？

特别是，预期输出如下。

shape: (4, 3)
┌──────┬──────────┬────────────┐
│ type ┆ id       ┆ sub_id     │
│ ---  ┆ ---      ┆ ---        │
│ str  ┆ str      ┆ str        │
╞══════╪══════════╪════════════╡
│ A    ┆ CASH     ┆ null       │
│ O    ┆ ORB.A123 ┆ "A123"     │
│ B    ┆ CHECK    ┆ null       │
│ O    ┆ OTC.BV32 ┆ "BV32"     │
└──────┴──────────┴────────────┘

Phil-ZXX

Asked: 2024-08-02 00:56:20 +0800 CST

同时在多个输出列上使用极坐标 when-then-otherwise

12

假设我有这个数据框

import polars as pl

df = pl.DataFrame({
    'item':         ['CASH', 'CHECK', 'DEBT', 'CHECK', 'CREDIT', 'CASH'],
    'quantity':     [100, -20, 0, 10, 0, 0],
    'value':        [99, 47, None, 90, None, 120],
    'value_other':  [97, 57, None, 91, None, 110],
    'value_other2': [94, 37, None, 93, None, 115],
})

┌────────┬──────────┬───────┬─────────────┬──────────────┐
│ item   ┆ quantity ┆ value ┆ value_other ┆ value_other2 │
│ ---    ┆ ---      ┆ ---   ┆ ---         ┆ ---          │
│ str    ┆ i64      ┆ i64   ┆ i64         ┆ i64          │
╞════════╪══════════╪═══════╪═════════════╪══════════════╡
│ CASH   ┆ 100      ┆ 99    ┆ 97          ┆ 94           │
│ CHECK  ┆ -20      ┆ 47    ┆ 57          ┆ 37           │
│ DEBT   ┆ 0        ┆ null  ┆ null        ┆ null         │
│ CHECK  ┆ 10       ┆ 90    ┆ 91          ┆ 93           │
│ CREDIT ┆ 0        ┆ null  ┆ null        ┆ null         │
│ CASH   ┆ 0        ┆ 120   ┆ 110         ┆ 115          │
└────────┴──────────┴───────┴─────────────┴──────────────┘

现在我想将0所有行的所有值列设置为value is null和quantity == 0。

现在我有这个解决方案

cols = ['value', 'value_other', 'value_other2']
df   = df.with_columns([
    pl.when(pl.col('value').is_null() & (pl.col('quantity') == 0))
    .then(0)
    .otherwise(pl.col(col))
    .alias(col)
    for col in cols
])

正确给出

┌────────┬──────────┬───────┬─────────────┬──────────────┐
│ item   ┆ quantity ┆ value ┆ value_other ┆ value_other2 │
│ ---    ┆ ---      ┆ ---   ┆ ---         ┆ ---          │
│ str    ┆ i64      ┆ i64   ┆ i64         ┆ i64          │
╞════════╪══════════╪═══════╪═════════════╪══════════════╡
│ CASH   ┆ 100      ┆ 99    ┆ 97          ┆ 94           │
│ CHECK  ┆ -20      ┆ 47    ┆ 57          ┆ 37           │
│ DEBT   ┆ 0        ┆ 0     ┆ 0           ┆ 0            │
│ CHECK  ┆ 10       ┆ 90    ┆ 91          ┆ 93           │
│ CREDIT ┆ 0        ┆ 0     ┆ 0           ┆ 0            │
│ CASH   ┆ 0        ┆ 120   ┆ 110         ┆ 115          │
└────────┴──────────┴───────┴─────────────┴──────────────┘

但是，我觉得这非常低效，因为我的when条件是针对每个值列执行的。有没有办法只使用极地内部函数而不使用原生 for 循环来实现这一点？

Phil-ZXX

Asked: 2024-07-31 21:33:07 +0800 CST

将数字类型的极坐标列与对象类型相乘（支持 mul）

8

我有以下代码。

import polars as pl

class Summary:
    def __init__(self, value: float, origin: str):
        self.value  = value
        self.origin = origin

    def __repr__(self) -> str:
        return f'Summary({self.value},{self.origin})'

    def __mul__(self, x: float | int) -> 'Summary':
        return Summary(self.value * x, self.origin)

    def __rmul__(self, x: float | int) -> 'Summary':
        return self * x

mapping = {
    'CASH':  Summary( 1, 'E'),
    'ITEM':  Summary(-9, 'A'),
    'CHECK': Summary(46, 'A'),
}

df = pl.DataFrame({'quantity': [7, 4, 10], 'type': mapping.keys(), 'summary': mapping.values()})

数据框df如下所示。

shape: (3, 3)
┌──────────┬───────┬───────────────┐
│ quantity ┆ type  ┆ summary       │
│ ---      ┆ ---   ┆ ---           │
│ i64      ┆ str   ┆ object        │
╞══════════╪═══════╪═══════════════╡
│ 7        ┆ CASH  ┆ Summary(1,E)  │
│ 4        ┆ ITEM  ┆ Summary(-9,A) │
│ 10       ┆ CHECK ┆ Summary(46,A) │
└──────────┴───────┴───────────────┘

特别地，summary列中包含一个Summary类对象，该类对象支持乘法。现在，我想将该列与该quantity列相乘。

然而，这种简单的方法会引发错误。

df.with_columns(pl.col('quantity').mul(pl.col('summary')).alias('qty_summary'))

SchemaError: failed to determine supertype of i64 and object

有没有办法将这些列相乘？

Phil-ZXX

Asked: 2024-07-31 20:18:14 +0800 CST

从字典创建极点数据框（键和值各自为列）

7

我有以下代码

import polars as pl

mapping = {
    'CASH':  {'qty':  1, 'origin': 'E'},
    'ITEM':  {'qty': -9, 'origin': 'A'},
    'CHECK': {'qty': 46, 'origin': 'A'},
}

df = pl.DataFrame([{'type': k} | v for k, v in mapping.items()])\
         .with_columns(pl.struct(['qty', 'origin']).alias('mapping'))\
         .select(pl.col(['type', 'mapping']))

因此，字典的键type应成为一个名为的新列，而字典的值mapping应位于其自己的列中。我的上述实现有效，df如下所示：

shape: (3, 2)
┌───────┬───────────┐
│ type  ┆ mapping   │
│ ---   ┆ ---       │
│ str   ┆ struct[2] │
╞═══════╪═══════════╡
│ CASH  ┆ {1,"E"}   │
│ ITEM  ┆ {-9,"A"}  │
│ CHECK ┆ {46,"A"}  │
└───────┴───────────┘

但是我的实现很长，而且看起来效率不高。有没有更惯用的极坐标方法来创建这个数据框？

polars DataFrame 中的 group-by 列

根据行数据对 Polars Dataframe 列进行排序

Polars Pivot 在求和时将空值视为 0

对极坐标数据框中同名（或“键”）的列求和

Polars pl.col(field).name.map_fields 适用于所有结构列（不是指定的列）

使用格式说明符将极坐标数据框中的浮点数/整数列转换为字符串

在多列上分解极坐标行，但逻辑不同

Polars Dataframe 在多个无后缀的列上进行全连接（外部）

极坐标拆分列并获取第 n 个（或最后一个）元素

同时在多个输出列上使用极坐标 when-then-otherwise

将数字类型的极坐标列与对象类型相乘（支持 mul）

从字典创建极点数据框（键和值各自为列）

重新格式化数字，在固定位置插入分隔符

为什么 C++20 概念会导致循环约束错误，而老式的 SFINAE 不会？

VScode 自动卸载扩展的问题（Material 主题）

Vue 3：创建时出错“预期标识符但发现‘导入’”[重复]

具有指定基础类型但没有枚举器的“枚举类”的用途是什么？

如何修复未手动导入的模块的 MODULE_NOT_FOUND 错误？

`(表达式，左值) = 右值` 在 C 或 C++ 中是有效的赋值吗？为什么有些编译器会接受/拒绝它？

在 C++ 中，一个不执行任何操作的空程序需要 204KB 的堆，但在 C 中则不需要

PowerBI 目前与 BigQuery 不兼容：Simba 驱动程序与 Windows 更新有关

AdMob：MobileAds.initialize() - 对于某些设备，“java.lang.Integer 无法转换为 java.lang.String”

Phil-ZXX's questions