如何将 for 循环拆分为 3 个单独的数据框？

Question

Padix Key

Asked: 2024-10-30 18:25:13 +0800 CST2024-10-30 18:25:13 +0800 CST 2024-10-30 18:25:13 +0800 CST

如何以矢量化方式根据第二个索引数组对值求和

772

假设我有一个值数组

values = np.array([0.0, 1.0, 2.0, 3.0, 4.0])

和一个索引数组

indices = np.array([0,1,0,2,2])

有没有一种矢量化的方法来对每个唯一索引的值求和？我的意思是获取此代码片段中的indices矢量化版本：sums

sums = np.zeros(np.max(indices)+1)
for index, value in zip(indices, values):
    sums[index] += value

如果解决方案允许values（并且因此sums）是多维的，则可获得加分。

编辑：我对发布的解决方案进行了基准测试：

import numpy as np
import time
import pandas as pd


values = np.arange(1_000_000, dtype=float)
rng = np.random.default_rng(0)
indices = rng.integers(0, 1000, size=1_000_000)


N = 100


now = time.time_ns()
for _ in range(N):
    sums = np.bincount(indices, weights=values, minlength=1000)
print(f"np.bincount: {(time.time_ns() - now) * 1e-6 / N:.3f} ms")


now = time.time_ns()
for _ in range(N):
    sums = np.zeros(1 + np.amax(indices), dtype=values.dtype)
    np.add.at(sums, indices, values)
print(f"np.add.at: {(time.time_ns() - now) * 1e-6 / N:.3f} ms")


now = time.time_ns()
for _ in range(N):
    pd.Series(values).groupby(indices).sum().values
print(f"pd.groupby: {(time.time_ns() - now) * 1e-6 / N:.3f} ms")


now = time.time_ns()
for _ in range(N):
    sums = np.zeros(np.max(indices)+1)
    for index, value in zip(indices, values):
        sums[index] += value
print(f"Loop: {(time.time_ns() - now) * 1e-6 / N:.3f} ms")

结果：

np.bincount: 1.129 ms
np.add.at: 0.763 ms
pd.groupby: 5.215 ms
Loop: 196.633 ms

3 个回答

Voted

tbhaxor · Answer 1 · 2024-10-30T18:48:16+08:00

tbhaxor

2024-10-30T18:48:16+08:002024-10-30T18:48:16+08:00

您可以将索引视为箱，将值视为权重np.bincount。

values = np.array([0.0, 1.0, 2.0, 3.0, 4.0])
indices = np.array([0,1,0,2,2])

sums = np.bincount(indices, weights=values)

3

PaulS · Answer 2 · 2024-10-30T20:14:59+08:00

Best Answer

PaulS

2024-10-30T20:14:59+08:002024-10-30T20:14:59+08:00

另一个可能的解决方案是：

首先，创建一个零数组，其长度等于数组b中唯一元素的数量indices
然后，它使用该np.add.at函数将数组中的值累积到数组指定values的相应位置。bindices

b = np.zeros(1 + np.amax(indices), dtype=values.dtype)
np.add.at(b, indices, values)

输出：

array([2., 1., 7.])

3

ThomasIsCoding · Answer 3 · 2024-10-30T20:38:45+08:00

ThomasIsCoding

2024-10-30T20:38:45+08:002024-10-30T20:38:45+08:00

你可以尝试pd.groupby

import pandas as pd

pd.Series(values).groupby(indices).sum().values

由此得出

array([2., 1., 7.])

1

如何以矢量化方式根据第二个索引数组对值求和

Vue 3：创建时出错“预期标识符但发现‘导入’”[重复]

为什么这个简单而小的 Java 代码在所有 Graal JVM 上的运行速度都快 30 倍，但在任何 Oracle JVM 上却不行？

具有指定基础类型但没有枚举器的“枚举类”的用途是什么？

如何修复未手动导入的模块的 MODULE_NOT_FOUND 错误？

`(表达式，左值) = 右值` 在 C 或 C++ 中是有效的赋值吗？为什么有些编译器会接受/拒绝它？

何时应使用 std::inplace_vector 而不是 std::vector？

在 C++ 中，一个不执行任何操作的空程序需要 204KB 的堆，但在 C 中则不需要

PowerBI 目前与 BigQuery 不兼容：Simba 驱动程序与 Windows 更新有关

AdMob：MobileAds.initialize() - 对于某些设备，“java.lang.Integer 无法转换为 java.lang.String”

我正在尝试仅使用海龟随机和数学模块来制作吃豆人游戏

如何以矢量化方式根据第二个索引数组对值求和

3 个回答

相关问题