ignoring_gravity提出的问题 -coding

ignoring_gravity

Asked: 2025-01-09 18:07:38 +0800 CST

关于python：来自Python字典的DuckDBPyRelation？

6

在 Polars / pandas / PyArrow 中，我可以从字典中实例化一个对象，例如

In [12]: pl.DataFrame({'a': [1,2,3], 'b': [4,5,6]})
Out[12]:
shape: (3, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 4   │
│ 2   ┆ 5   │
│ 3   ┆ 6   │
└─────┴─────┘

有没有办法在 DuckDB 中做到这一点，而无需通过 pandas/pyarrow/等？

ignoring_gravity

Asked: 2024-12-29 03:09:13 +0800 CST

使用 DuckDB 关系 API 进行“n_unique”聚合

7

说我有

import duckdb

rel = duckdb.sql('select * from values (1, 4), (1, 5), (2, 6) df(a, b)')
rel

Out[3]: 
┌───────┬───────┐
│   a   │   b   │
│ int32 │ int32 │
├───────┼───────┤
│     1 │     4 │
│     1 │     5 │
│     2 │     6 │
└───────┴───────┘

我可以按 a 分组并通过执行以下操作找到“b”的平均值：

rel.aggregate(
    [duckdb.FunctionExpression('mean', duckdb.ColumnExpression('b'))],
    group_expr='a',
)

┌─────────┐
│ mean(b) │
│ double  │
├─────────┤
│     4.5 │
│     6.0 │
└─────────┘

效果非常好

有没有类似的方法来创建“n_unique”聚合？我正在寻找类似的东西

rel.aggregate(
    [duckdb.FunctionExpression('count_distinct', duckdb.ColumnExpression('b'))],
    group_expr='a',
)

但那并不存在。有什么东西可以做到吗？

ignoring_gravity

Asked: 2024-11-11 02:55:19 +0800 CST

使用 DuckDB 的 Python 关系 API 进行滚动求和

8

说我有

data = {'id': [1, 1, 1, 2, 2, 2],
 'd': [1, 2, 3, 1, 2, 3],
 'sales': [1, 4, 2, 3, 1, 2]}

我想计算一个滚动总和，窗口为 2，按 'id' 分区，按 'd' 排序

使用 SQL 我可以做到：

duckdb.sql("""
select *, sum(sales) over w as rolling_sales
from df
window w as (partition by id order by d rows between 1 preceding and current row)
""")
Out[21]:
┌───────┬───────┬───────┬───────────────┐
│  id   │   d   │ sales │ rolling_sales │
│ int64 │ int64 │ int64 │    int128     │
├───────┼───────┼───────┼───────────────┤
│     1 │     1 │     1 │             1 │
│     1 │     2 │     4 │             5 │
│     1 │     3 │     2 │             6 │
│     2 │     1 │     3 │             3 │
│     2 │     2 │     1 │             4 │
│     2 │     3 │     2 │             3 │
└───────┴───────┴───────┴───────────────┘

这很有效，但是我如何使用 Python 关系 API 来实现它呢？

我已经

rel = duckdb.sql('select * from df')
rel.sum(
    'sales',
    projected_columns='*',
    window_spec='over (partition by id order by d rows between 1 preceding and current row)'
)

由此得出

┌───────────────────────────────────────────────────────────────────────────────────────┐
│ sum(sales) OVER (PARTITION BY id ORDER BY d ROWS BETWEEN 1 PRECEDING AND CURRENT ROW) │
│                                        int128                                         │
├───────────────────────────────────────────────────────────────────────────────────────┤
│                                                                                     3 │
│                                                                                     4 │
│                                                                                     3 │
│                                                                                     1 │
│                                                                                     5 │
│                                                                                     6 │
└───────────────────────────────────────────────────────────────────────────────────────┘

这很接近，但不太正确 - 我如何获取最后一列的名称rolling_sales？

ignoring_gravity

Asked: 2024-10-21 23:38:33 +0800 CST

根据映射替换数组中的所有值

5

假设我有：

import pyarrow as pa

arr = pa.array([1, 3, 2, 2, 1, 3])

我想根据{1: 'one', 2: 'two', 3: 'three'}并最终替换值：

<pyarrow.lib.LargeStringArray object at 0x7f8dd0b3c820>
[
  "one",
  "three",
  "two",
  "two",
  "one",
  "three"
]

我可以通过 Polars 来实现这一点：

In [19]: pl.from_arrow(arr).replace_strict({1: 'one', 2: 'two', 3: 'three'}, return_dtype=pl.String).to_arrow()
Out[19]:
<pyarrow.lib.LargeStringArray object at 0x7f8dd0b3c820>
[
  "one",
  "three",
  "two",
  "two",
  "one",
  "three"
]

有没有办法只用 PyArrow 来完成它？

ignoring_gravity

Asked: 2024-09-11 16:07:12 +0800 CST

pyarrow chunkedarray 获取给定索引处的项目

5

说我有

In [3]: import pyarrow as pa

In [4]: ca = pa.chunked_array([[1,2,3], [4,5,6]])

我想提取元素[1, 4, 2]并最终得到

<pyarrow.lib.Int64Array object at 0x7f6eb43c2d40>
[
  2,
  5,
  3
]

就像我在做 NumPy 风格的索引一样

ignoring_gravity

Asked: 2024-09-05 21:17:09 +0800 CST

滚动平均值的最小周期

9

假设我有：

data = {
    'id': ['a', 'a', 'a', 'b', 'b', 'b', 'b'],
    'd': [1,2,3,0,1,2,3],
    'sales': [5,1,3,4,1,2,3],
}

我想添加一个带有滚动平均值的列，窗口大小为 2 min_periods=2，'id'

在 Polars 中，我可以执行以下操作：

import polars as pl

df = pl.DataFrame(data)
df.with_columns(sales_rolling = pl.col('sales').rolling_mean(2).over('id'))

shape: (7, 4)
┌─────┬─────┬───────┬───────────────┐
│ id  ┆ d   ┆ sales ┆ sales_rolling │
│ --- ┆ --- ┆ ---   ┆ ---           │
│ str ┆ i64 ┆ i64   ┆ f64           │
╞═════╪═════╪═══════╪═══════════════╡
│ a   ┆ 1   ┆ 5     ┆ null          │
│ a   ┆ 2   ┆ 1     ┆ 3.0           │
│ a   ┆ 3   ┆ 3     ┆ 2.0           │
│ b   ┆ 0   ┆ 4     ┆ null          │
│ b   ┆ 1   ┆ 1     ┆ 2.5           │
│ b   ┆ 2   ┆ 2     ┆ 1.5           │
│ b   ┆ 3   ┆ 3     ┆ 2.5           │
└─────┴─────┴───────┴───────────────┘

DuckDB 的对应产品是什么？我试过

import duckdb

duckdb.sql("""
    select
        *,
        mean(sales) over (
            partition by id 
            order by d
            range between 1 preceding and 0 following
        ) as sales_rolling 
    from df
""").sort('id', 'd')

但得到

┌─────────┬───────┬───────┬───────────────┐
│   id    │   d   │ sales │ sales_rolling │
│ varchar │ int64 │ int64 │    double     │
├─────────┼───────┼───────┼───────────────┤
│ a       │     1 │     5 │           5.0 │
│ a       │     2 │     1 │           3.0 │
│ a       │     3 │     3 │           2.0 │
│ b       │     0 │     4 │           4.0 │
│ b       │     1 │     1 │           2.5 │
│ b       │     2 │     2 │           1.5 │
│ b       │     3 │     3 │           2.5 │
└─────────┴───────┴───────┴───────────────┘

这非常接近，但当窗口中只有一个值时，duckdb 仍会计算滚动平均值。我如何复制min_periods=2Polars 的（默认）行为？

ignoring_gravity

Asked: 2024-08-24 17:39:07 +0800 CST

如何将两个 PyArrow 数组压缩在一起？

6

在极坐标中，我可以使用从掩码中或根据掩码zip_width获取值：s1s2

In [1]: import polars as pl

In [2]: import pyarrow as pa

In [3]: import pyarrow as pc

In [4]: s1 = pl.Series([1,2,3])

In [5]: mask = pl.Series([True, False, False])

In [6]: s2 = pl.Series([4, 5, 6])

In [7]: s1.zip_with(mask, s2)
Out[7]:
shape: (3,)
Series: '' [i64]
[
        1
        5
        6
]

我该如何使用 PyArrow 来实现这一点？我试过了，pyarrow.compute.replace_with_mask但效果不一样：

In [10]: import pyarrow.compute as pc

In [11]: import pyarrow as pa

In [12]: a1 = pa.array([1,2,3])

In [13]: mask = pa.array([True, False, False])

In [14]: a2 = pa.array([4,5,6])

In [15]: pc.replace_with_mask(a1, pc.invert(mask), a2)
Out[15]:
<pyarrow.lib.Int64Array object at 0x7f69d411afe0>
[
  1,
  4,
  5
]

如何zip_with在 PyArrow 中复制？

ignoring_gravity

Asked: 2024-07-15 18:05:17 +0800 CST

如何将字符串附加到分块数组的每个元素？

5

说我有

In [22]: import pyarrow as pa

In [23]: t = pa.table({'a': ['one', 'two', 'three']})

我想'_frobenius'将'a'

预期输出：

pyarrow.Table
a: string
----
a: [["one_frobenius","two_frobenius","three_frobenius"]]

ignoring_gravity

Asked: 2024-03-16 15:45:23 +0800 CST

为什么 `.rename(columns={'b': 'b'}, copy=False)` 后跟 inplace 方法不更新原始数据帧？

7

这是我的例子：

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'a': [1,2,3], 'b': [4,5,6]})

In [3]: df1 = df.rename(columns={'b': 'b'}, copy=False)

In [4]: df1.isetitem(1, [7,8,9])

In [5]: df
Out[5]:
   a  b
0  1  4
1  2  5
2  3  6

In [6]: df1
Out[6]:
   a  b
0  1  7
1  2  8
2  3  9

如果df1是从dfwith派生的copy=False，那么我预计对的就地修改df1也会影响df。但事实并非如此。为什么？

我正在使用 pandas 版本 2.2.1，未启用任何选项（例如写入时复制）

关于python：来自Python字典的DuckDBPyRelation？

使用 DuckDB 关系 API 进行“n_unique”聚合

使用 DuckDB 的 Python 关系 API 进行滚动求和

根据映射替换数组中的所有值

pyarrow chunkedarray 获取给定索引处的项目

滚动平均值的最小周期

如何将两个 PyArrow 数组压缩在一起？

如何将字符串附加到分块数组的每个元素？

为什么 `.rename(columns={'b': 'b'}, copy=False)` 后跟 inplace 方法不更新原始数据帧？

Vue 3：创建时出错“预期标识符但发现‘导入’”[重复]

为什么这个简单而小的 Java 代码在所有 Graal JVM 上的运行速度都快 30 倍，但在任何 Oracle JVM 上却不行？

具有指定基础类型但没有枚举器的“枚举类”的用途是什么？

如何修复未手动导入的模块的 MODULE_NOT_FOUND 错误？

`(表达式，左值) = 右值` 在 C 或 C++ 中是有效的赋值吗？为什么有些编译器会接受/拒绝它？

何时应使用 std::inplace_vector 而不是 std::vector？

在 C++ 中，一个不执行任何操作的空程序需要 204KB 的堆，但在 C 中则不需要

PowerBI 目前与 BigQuery 不兼容：Simba 驱动程序与 Windows 更新有关

AdMob：MobileAds.initialize() - 对于某些设备，“java.lang.Integer 无法转换为 java.lang.String”

我正在尝试仅使用海龟随机和数学模块来制作吃豆人游戏

ignoring_gravity's questions