好的,我正在使用如下所示的 for 循环,使用异或累加将此数据转换为下面的数据。对于我有(830401)行的条目,这非常非常慢。有没有什么方法可以加速 pandas 中的这种累积或使用 numpy 然后将其返回 numpy 数组本身
In [122]: acctable[0:20]
Out[122]:
what dx1 dx2 dx3 dx4 dx5 dx6 dx7 dx8 dx9
0 4 2 10 8 0 5 7 1 13 11
1 4 0 0 0 0 0 0 0 0 0
2 6 0 0 0 0 0 0 0 0 0
3 14 0 0 0 0 0 0 0 0 0
4 12 0 0 0 0 0 0 0 8 0
5 4 0 0 0 0 0 0 0 0 0
6 1 0 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ...
830477 15 0 0 0 0 0 0 0 0 0
830478 3 0 0 0 0 0 0 0 0 0
830479 11 0 0 0 0 0 0 0 0 0
830480 9 0 0 0 0 0 0 0 0 0
830481 11 0 0 0 0 0 0 0 0 0
[830482 rows x 10 columns]
这是我尝试过的,它实际上可能需要一整分钟,而且我有更大的数据集可以使用,所以任何快捷方式或最佳方法都会很有帮助:
# Update: Instead of all 800k of 'what', i put the first 5 numbers in rstr so you can see how i'm xor accumulating. You should be able to copy/paste the first 6 elements of the data from with pd.read_clipboard() and assign to acctable.
In [121]: rstr
Out[121]: array([ 4, 4, 12, 14, 6, 4], dtype=int8)
dt = np.int8
rstr = np.array(acctable.loc[:5, ('what')], dtype=dt)
for x in range(4): # # Prime Sequencing Functions
wuttr = np.bitwise_xor.accumulate(np.r_[[rstr[-(x+1)]], acctable.loc[x, 'what':]], dtype=dt)
acctable.loc[x+1, "what":] = wuttr[:end]
后:
In [122]: acctable[0:20]
Out[122]:
what dx1 dx2 dx3 dx4 dx5 dx6 dx7 dx8 dx9
0 4 2 10 8 0 5 7 1 13 11
1 4 0 2 8 0 0 5 2 3 14
2 6 2 2 0 8 8 8 13 15 12
3 14 8 10 8 8 0 8 0 13 2
4 12 2 10 0 8 0 0 8 8 5
5 4 8 10 0 0 8 8 8 0 8
6 1 5 13 7 7 7 15 7 15 15
... ... ... ... ... ... ... ... ... ... ...
830477 15 15 7 0 0 5 9 14 10 3
830478 3 12 3 4 4 4 1 8 6 12
830479 11 8 4 7 3 7 3 2 10 12
830480 9 2 10 14 9 10 13 14 12 6
830481 11 2 0 10 4 13 7 10 4 8
[830482 rows x 10 columns]
这是一个简单的累加,但您需要前一行才能继续累加,而我能做的唯一方法是使用 for 循环。另外,“rstr”变量实际上是“什么”列。
谢谢!
我从人工智能收到了这个结果,但它只适用于第一行:
what_arr = acctable['what'].to_numpy().reshape(-1) # Reshape to ensure 1D array
# Modified XOR accumulation:
all_what_arr = np.concatenate([[what_arr[0]], what_arr[1:]])
cumulative_xor = np.bitwise_xor.accumulate(all_what_arr)
shifted_xor = cumulative_xor[1:].reshape(-1, 1)
acctable.iloc[1:, 1:] = shifted_xor ^ acctable.iloc[1:, 1:]
In [171]: acctable
Out[171]:
what dx1 dx2 dx3 dx4 dx5 dx6 dx7 dx8 dx9
0 4 2 10 8 0 5 7 1 13 11
1 6 0 2 8 0 0 5 2 3 14
2 14 4 6 6 12 14 12 11 11 10
3 12 2 10 0 10 10 8 8 15 8
4 4 12 14 4 4 14 4 12 4 11
以下是 timeit 值,您可以看到 Andrej 的修改和 njit 的使用是加速的一个重要因素!
In [262]: import timeit
...:
...: setup = """
...: import numpy as np
...: import pandas as pd
...: from numba import njit
...:
...:
...: def do_work_no_njit(df):
...: dt = np.int8
...: end = -1
...: rstr = np.array(df.loc[:, 0], dtype=dt)
...: for x in range(len(df)):
...: wuttr = np.bitwise_xor.accumulate(np.r_[[rstr[-(x+1)]], df.loc[x, 0:]], dtype=dt)
...: df.loc[x+1, 0:] = wuttr[:end]
...:
...: @njit
...: def do_work(vals):
...: for row in range(vals.shape[0] - 1):
...: for i in range(vals.shape[1] - 1):
...: vals[row + 1, i + 1] = vals[row, i] ^ vals[row + 1, i]
...:
...: # Replace with your DataFrame creation code
...: df = pd.DataFrame(np.random.randint(0, 15, size=(1000000, 10)), dtype=np.int8) # Example DataFrame, dtype=np.int8) # Example DataFrame
...: """
...:
...: stmt = """
...: do_work(df.values)
...: """
...:
...: stmtnonjit = """
...: do_work_no_njit(df.copy())
...: """
...:
...: number = 1 # Adjust the number of repetitions as needed
...:
...: time = timeit.timeit(stmtnonjit, setup, number=number)
...: print(f"Average time per execution no njit: {time / number:.4f} seconds")
...:
...: time = timeit.timeit(stmt, setup, number=number)
...: print(f"Average time per execution with njit and optimized code by Andrej: {time / number:.4f} seconds")
...:
Average time per execution no njit: 73.3801 seconds
Average time per execution with njit and optimized code by Andrej: 0.0442 seconds