关于【python-polars】的问题- 第1页

Ghost

Asked: 2025-03-25 01:51:43 +0800 CST

来自嵌套字典的 Polars Dataframe 作为列

8

我有一个嵌套列的字典，每个列都以索引作为键。当我尝试将其转换为 polars 数据框时，它会正确获取列名和值，但每列只有一个元素，即列元素的字典，而不会将其“扩展”为一系列。

举个例子，假设我有：

d = {'col1': {'0':'A','1':'B','2':'C'}, 'col2': {'0':1,'1':2,'2':3}}

然后，当我执行pl.DataFrame(d)或时pl.from_dict(d)，我得到：

col1           col2
---            ---
struct[3]      struct[3]
{"A","B","C"}  {1,2,3}

而不是常规的数据框。

知道如何修复这个问题吗？

提前致谢！

apostofes

Asked: 2024-11-06 13:47:43 +0800 CST

在 polars 中 join_where 与 starts_with

7

我有两个数据框，

df = pl.DataFrame({'url': ['https//abc.com', 'https//abcd.com', 'https//abcd.com/aaa', 'https//abc.com/abcd']})

conditions_df = pl.DataFrame({'url': ['https//abc.com', 'https//abcd.com', 'https//abcd.com/aaa', 'https//abc.com/aaa'], 'category': [['a'], ['b'], ['c'], ['d']]})

现在我想要一个 df，用于根据第二个 df 中以 url 开头的第一个匹配项为第一个 df 分配类别，即输出应该是，

网址	类别
https//abc.com	['一个']
https//abcd.com	['b']
https//abcd.com/aaa	['b'] - 这个以 https//abcd.com 开头，这是第一个匹配
https//abc.com/abcd	['a'] - 这个以 https//abc.com 开头，这是第一个匹配

目前有效的代码是这样的，

def add_category_column(df: pl.DataFrame, conditions_df) -> pl.DataFrame:
    
    # Initialize the category column with empty lists
    df = df.with_columns(pl.Series("category", [[] for _ in range(len(df))], dtype=pl.List(pl.String)))
    
    # Apply the conditions to populate the category column
    for row in conditions_df.iter_rows():
        url_start, category = row
        df = df.with_columns(
            pl.when(
                (pl.col("url").str.starts_with(url_start)) & (pl.col("category").list.len() == 0)
            )
            .then(pl.lit(category))
            .otherwise(pl.col("category"))
            .alias("category")
        )
    
    return df

但是有没有办法在不使用 for 循环的情况下实现相同的效果，我们可以在这里使用 join_where 吗，但在我的尝试中 join_where 对 starts_with 不起作用

bzm3r

Asked: 2024-10-31 01:25:59 +0800 CST

如何展平列表类型列表的列元素，使其成为具有列表类型元素的列？

7

请考虑以下示例：

import polars as pl

pl.DataFrame(pl.Series("x", ["1, 0", "2,3", "5 4"])).with_columns(
    pl.col("x").str.split(",").list.eval(pl.element().str.split(" "))
)

shape: (3, 1)
┌────────────────────┐
│ x                  │
│ ---                │
│ list[list[str]]    │
╞════════════════════╡
│ [["1"], ["", "0"]] │
│ [["2"], ["3"]]     │
│ [["5", "4"]]       │
└────────────────────┘

我想将列中的元素展平，这样这些元素就不再是嵌套列表，而是列表。我该怎么做？

Kazdegotepu

Asked: 2024-10-11 18:13:29 +0800 CST

富	foo_count	酒吧	条形计数	巴兹	baz_count	最大的
1	23	4	43	5	64	巴兹
2	四十五	6	四十五	1	43	酒吧
3	234	9	453	15	231	巴兹
4	55	2	67	3	94	富

富	foo_count	酒吧	条形计数	巴兹	baz_count	最大的	最大计数
1	23	4	43	5	64	巴兹	64
2	四十五	6	四十五	1	43	酒吧	四十五
3	234	9	453	15	231	巴兹	231
4	55	2	67	3	94	富	4

如何在 List[f64] 类型的 Polars 中创建浮点序列

8

我有一个极坐标 List[f64]，列为“a”。我想创建一个新的 List[f64]，列为“b”，它是 a 列中该行列表的最小值到最大值的序列，间隔为 0.5（含）。因此，对于列“a”列表中的行[0.0, 3.0, 2.0, 6.0, 2.0]，b 列中的值应为[0.0, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0, 5.5, 6.0]。

这是我的解决方案，但是它有一个错误。

df = df.with_columns(
        pl.col("a").list.eval(
            pl.arange(pl.element().min(), pl.element().max(), 1)
            .append(pl.arange(pl.element().min(), pl.element().max(), 1) + 0.5)
            .append(pl.element().max())
            .append(pl.element().max() - 0.5)
            .unique()
            .sort(),
            parallel=True,
        )
        .alias("b")
    )

当 a 列的列表中仅包含 1 个唯一值时，它无法处理极端情况。由于 polars 似乎只有一个整数arange()函数，当我创建第二个列表并添加 0.5 时，如果只有一个唯一值，则会导致输出中有 2 个值，即实际看到的值和实际看到的值 - 0.5

这是一些玩具数据。列“a”包含列表，其中的最小值和最大值应用于定义序列的边界，即列“b”。

pl.DataFrame([
    pl.Series('a', [[4.0, 5.0, 3.0, 7.0, 0.0, 1.0, 6.0, 2.0], [2.0, 4.0, 3.0, 0.0, 1.0], [1.0, 2.0, 3.0, 0.0, 4.0], [1.0, 3.0, 2.0, 0.0], [1.0, 0.0]], dtype=pl.List(pl.Float64)),        
    pl.Series('b', [[0.0, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0, 5.5, 6.0, 6.5, 7.0], [0.0, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0], [0.0, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0], [0.0, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0], [0.0, 0.5, 1.0]], dtype=pl.List(pl.Float64))
])

速度在这里非常重要，为此，我正在用 Polars 重写。谢谢。

gillesa

Asked: 2024-10-06 20:56:51 +0800 CST

从熊猫视角使用 Polars 裁剪标签

8

我正在将一些代码从迁移Pandas到Polars。我尝试使用cut但polars存在差异（没有bin，所以我必须计算它）。

label但我还是不明白极坐标的结果。

我必须使用比我想要的更多的标签才能获得相同的结果pandas。

import numpy as np
import pandas as pd
import polars as pl

# Exemple de DataFrame Polars
data = {
    "value": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
}
df_pl = pl.DataFrame(data)

# Convertir en DataFrame Pandas pour obtenir les breakpoints
df_pd = df_pl.to_pandas()

# Use returbins to get the breakpoints (from pandas)
df_pd["cut_label_pd"], breakpoints = pd.cut(df_pd["value"], 4, labels=["low", "medium", "hight", "very high"], retbins=True)
print(pl.from_pandas(df_pd))

shape: (10, 2)
┌───────┬──────────────┐
│ value ┆ cut_label_pd │
│ ---   ┆ ---          │
│ i64   ┆ cat          │
╞═══════╪══════════════╡
│ 1     ┆ low          │
│ 2     ┆ low          │
│ 3     ┆ low          │
│ 4     ┆ medium       │
│ 5     ┆ medium       │
│ 6     ┆ hight        │
│ 7     ┆ hight        │
│ 8     ┆ very high    │
│ 9     ┆ very high    │
│ 10    ┆ very high    │
└───────┴──────────────┘

print(breakpoints)
# [ 0.991  3.25   5.5    7.75  10.   ]

labels有没有更好的方法？（注意中的值polars cut）

# Cut in polars
labels = ["don't use it", "low", "medium", "hight", "very high", "don't use it too"] 
df_pl = df_pl.with_columns(
    pl.col("value").cut(breaks=breakpoints, labels=labels).alias("cut_label_pl")
)

print(df_pl)

shape: (10, 2)
┌───────┬──────────────┐
│ value ┆ cut_label_pl │
│ ---   ┆ ---          │
│ i64   ┆ cat          │
╞═══════╪══════════════╡
│ 1     ┆ low          │
│ 2     ┆ low          │
│ 3     ┆ low          │
│ 4     ┆ medium       │
│ 5     ┆ medium       │
│ 6     ┆ hight        │
│ 7     ┆ hight        │
│ 8     ┆ very high    │
│ 9     ┆ very high    │
│ 10    ┆ very high    │
└───────┴──────────────┘

usdn

Asked: 2024-09-17 06:40:08 +0800 CST

极坐标系中的滚动模式

7

我有一个长度约为 100M 行的数据框，其中包含不同组中的 ID。其中一些是错误的（以 99 表示）。我正尝试使用滚动模式窗口来纠正它们，类似于下面的代码示例。有没有更好的方法来做到这一点，因为 rolling_map() 非常慢？

import polars as pl
from scipy import stats

def dummy(input):
    return stats.mode(input)[0]

df = pl.DataFrame({'group': [10, 10, 10, 10, 10, 10, 10, 20, 20, 20, 20],
                   'id': [1, 1, 99, 1, 1, 2, 2, 3, 3, 99, 3]})

df.with_columns(pl.col('id')
                 .rolling_map(function=dummy,
                              window_size=3,
                              min_periods=1,
                              center=True)
                 .over('group')
                 .alias('id_mode'))

shape: (11, 3)
╭───────┬─────┬─────────╮
│ group ┆  id ┆ id_mode │
│   i64 ┆ i64 ┆     i64 │
╞═══════╪═════╪═════════╡
│    10 ┆   1 ┆       1 │
│    10 ┆   1 ┆       1 │
│    10 ┆  99 ┆       1 │
│    10 ┆   1 ┆       1 │
│    10 ┆   1 ┆       1 │
│    10 ┆   2 ┆       2 │
│    10 ┆   2 ┆       2 │
│    20 ┆   3 ┆       3 │
│    20 ┆   3 ┆       3 │
│    20 ┆  99 ┆       3 │
│    20 ┆   3 ┆       3 │
╰───────┴─────┴─────────╯

Connor Elliott

Asked: 2024-09-13 04:11:14 +0800 CST

Polars 查询优化：分类列上的字符串操作

5

对列进行字符串操作是否先将Categorical整个列转换为字符集String，然后执行操作，或者在可能的情况下是否直接对（可能小得多的）分类词典进行操作？

例如df.filter(pl.col('my_category').cast(pl.String).str.contains(...))（还有str.starts_with(...)朋友等）或df.with_columns(pl.col('my_category').cast(pl.String).str.replace(...).cast(pl.Categorical))

Phil-ZXX

Asked: 2024-09-11 21:17:40 +0800 CST

Polars pl.col(field).name.map_fields 适用于所有结构列（不是指定的列）

7

我有这个代码：

import polars as pl

cols = ['Delta', 'Qty']

metrics = {'CHECK.US': {'Delta': {'ABC': 1, 'DEF': 2}, 'Qty': {'GHIJ': 3, 'TT': 4}},
           'CHECK.NA': {},
           'CHECK.FR': {'Delta': {'QQQ': 7, 'ABC': 6}, 'Qty': {'SS': 9, 'TT': 5}}
          }

df = pl.DataFrame([{col: v.get(col) for col in cols} for v in metrics.values()])\
       .insert_column(0, pl.Series('key', metrics.keys()))\
       .with_columns([pl.col(col).name.map_fields(lambda x: f'{col} ({x})') for col in cols])

现在，df.unnest('Qty')正确给出所有列的格式为Qty (xxx)：

shape: (3, 5)
┌──────────┬────────────┬────────────┬──────────┬──────────┐
│ key      ┆ Delta      ┆ Qty (GHIJ) ┆ Qty (TT) ┆ Qty (SS) │
│ ---      ┆ ---        ┆ ---        ┆ ---      ┆ ---      │
│ str      ┆ struct[3]  ┆ i64        ┆ i64      ┆ i64      │
╞══════════╪════════════╪════════════╪══════════╪══════════╡
│ CHECK.US ┆ {1,2,null} ┆ 3          ┆ 4        ┆ null     │
│ CHECK.NA ┆ null       ┆ null       ┆ null     ┆ null     │
│ CHECK.FR ┆ {6,null,7} ┆ null       ┆ 5        ┆ 9        │
└──────────┴────────────┴────────────┴──────────┴──────────┘

但是，当我执行同样的事情时，df.unnest('Delta')它会错误地返回以下列Qty (xxx)：

shape: (3, 5)
┌──────────┬───────────┬───────────┬───────────┬────────────┐
│ key      ┆ Qty (ABC) ┆ Qty (DEF) ┆ Qty (QQQ) ┆ Qty        │
│ ---      ┆ ---       ┆ ---       ┆ ---       ┆ ---        │
│ str      ┆ i64       ┆ i64       ┆ i64       ┆ struct[3]  │
╞══════════╪═══════════╪═══════════╪═══════════╪════════════╡
│ CHECK.US ┆ 1         ┆ 2         ┆ null      ┆ {3,4,null} │
│ CHECK.NA ┆ null      ┆ null      ┆ null      ┆ null       │
│ CHECK.FR ┆ 6         ┆ null      ┆ 7         ┆ {null,5,9} │
└──────────┴───────────┴───────────┴───────────┴────────────┘

值看起来正确，只是列名错误。

我使用的pl.col(col).name.map_field(...)方式不正确吗？如何修复代码，使输出变成这样：

shape: (3, 5)
┌──────────┬─────────────┬─────────────┬─────────────┬────────────┐
│ key      ┆ Delta (ABC) ┆ Delta (DEF) ┆ Delta (QQQ) ┆ Qty        │
│ ---      ┆ ---         ┆ ---         ┆ ---         ┆ ---        │
│ str      ┆ i64         ┆ i64         ┆ i64         ┆ struct[3]  │
╞══════════╪═════════════╪═════════════╪═════════════╪════════════╡

？

user432299

Asked: 2024-09-11 01:57:42 +0800 CST

Python Polars 样本 N-1（按组 ID 进行替换）

6

我正在开展一个引导项目，需要对 M=N-1 个观测值进行放回抽样，其中 N 是特定组（由 group_id 定义）中唯一观测值的数量。我需要弄清楚如何在极坐标系中执行此操作。有什么解决方案吗？

这是一个展示我想要完成的事情的例子：

# Have:
water_data = {
    'group_id': [1,1,1,1,2,2,2,3,3,3,4,4,4,4,5,5,5],
    'obs_id_within_group': [1,2,3,4,1,2,3,1,2,3,1,2,3,4,1,2,3],
    'N': [4,4,4,4,3,3,3,3,3,3,4,4,4,4,3,3,3],
    'M': [3,3,3,3,2,2,2,2,2,2,3,3,3,3,2,2,2],
    'water_gallons': [12,23,21,11,10,10,10,23,24,25,27,30,17,12,11,14,20],
    'water_source': ['lake','lake','pond','river','lake','glacier','glacier','lake','pond','river','lake','lake','pond','river','river','lake','glacier'],
    'water_acidity': [3,4,5,1,2,4,3,2,3,3,4,6,7,8,8,3,1]
}
df=pl.DataFrame(water_data)
print(df)

# Want to randomly sample with replacement to:
sampled_water_data = {
    'group_id':            [1,1,1,2,2,3,3,4,4,4,5,5],
    'obs_id_within_group': [1,2,2,3,3,3,2,4,1,1,2,1],
    'N': [4,4,4,3,3,3,3,4,4,4,3,3],
    'M': [3,3,3,2,2,2,2,3,3,3,2,2],
    'water_gallons': [12,23,23,10,10,25,24,12,27,27,14,11],
    'water_source': ['lake','lake','lake','glacier','glacier','river','pond','river','lake','lake','lake','river'],
    'water_acidity': [3,4,4,3,3,3,3,8,4,4,5,8]
}
df_sampled=pl.DataFrame(sampled_water_data)
print(df_sampled)

不确定如何从每个组中抽取一个特定的数字。

来自嵌套字典的 Polars Dataframe 作为列

在 polars 中 join_where 与 starts_with

如何展平列表类型列表的列元素，使其成为具有列表类型元素的列？

嵌套的 polars.col() [重复]

如何在 List[f64] 类型的 Polars 中创建浮点序列

从熊猫视角使用 Polars 裁剪标签

极坐标系中的滚动模式

Polars 查询优化：分类列上的字符串操作

Polars pl.col(field).name.map_fields 适用于所有结构列（不是指定的列）

Python Polars 样本 N-1（按组 ID 进行替换）

重新格式化数字，在固定位置插入分隔符

为什么 C++20 概念会导致循环约束错误，而老式的 SFINAE 不会？

VScode 自动卸载扩展的问题（Material 主题）

Vue 3：创建时出错“预期标识符但发现‘导入’”[重复]

具有指定基础类型但没有枚举器的“枚举类”的用途是什么？

如何修复未手动导入的模块的 MODULE_NOT_FOUND 错误？

`(表达式，左值) = 右值` 在 C 或 C++ 中是有效的赋值吗？为什么有些编译器会接受/拒绝它？

在 C++ 中，一个不执行任何操作的空程序需要 204KB 的堆，但在 C 中则不需要

PowerBI 目前与 BigQuery 不兼容：Simba 驱动程序与 Windows 更新有关

AdMob：MobileAds.initialize() - 对于某些设备，“java.lang.Integer 无法转换为 java.lang.String”

问题[python-polars](coding)