如何将 for 循环拆分为 3 个单独的数据框？

Question

Whitebeard13

Asked: 2025-04-04 18:22:33 +0800 CST2025-04-04 18:22:33 +0800 CST 2025-04-04 18:22:33 +0800 CST

在python pandas中的多列上按分组并应用多个自定义函数

772

考虑以下数据框示例：

id  date        hrz tenor   1       2       3       4
AAA 16/03/2010  2   6m      0.54    0.54    0.78    0.19
AAA 30/03/2010  2   6m      0.05    0.67    0.20    0.03
AAA 13/04/2010  2   6m      0.64    0.32    0.13    0.20
AAA 27/04/2010  2   6m      0.99    0.53    0.38    0.97
AAA 11/05/2010  2   6m      0.46    0.90    0.11    0.14
AAA 25/05/2010  2   6m      0.41    0.06    0.96    0.31
AAA 08/06/2010  2   6m      0.19    0.73    0.58    0.80
AAA 22/06/2010  2   6m      0.40    0.95    0.14    0.56
AAA 06/07/2010  2   6m      0.22    0.74    0.85    0.94
AAA 20/07/2010  2   6m      0.34    0.17    0.03    0.77
AAA 03/08/2010  2   6m      0.13    0.32    0.39    0.95
AAA 16/03/2010  2   1y      0.54    0.54    0.78    0.19
AAA 30/03/2010  2   1y      0.05    0.67    0.20    0.03
AAA 13/04/2010  2   1y      0.64    0.32    0.13    0.20
AAA 27/04/2010  2   1y      0.99    0.53    0.38    0.97
AAA 11/05/2010  2   1y      0.46    0.90    0.11    0.14
AAA 25/05/2010  2   1y      0.41    0.06    0.96    0.31
AAA 08/06/2010  2   1y      0.19    0.73    0.58    0.80
AAA 22/06/2010  2   1y      0.40    0.95    0.14    0.56
AAA 06/07/2010  2   1y      0.22    0.74    0.85    0.94
AAA 20/07/2010  2   1y      0.34    0.17    0.03    0.77
AAA 03/08/2010  2   1y      0.13    0.32    0.39    0.95

如何使用grouby变量id, hrz并tenor在不同日期间应用以下自定义函数？

 def ks_test(x):
    return scipy.stats.kstest(np.sort(x), 'uniform')[0]

 def cvm_test(x):
    n = len(x)
    i = np.arange(1, n + 1)
    x = np.sort(x)
    w2 = (1 / (12 * n)) + np.sum((x - ((2 * i - 1) / (2 * n))) ** 2)
    return w2

所需的输出是以下数据框（图形结果仅为示例）：

id   hrz    tenor   test        1       2       3       4
AAA  2      6m      ks_test     0.04    0.06    0.02    0.03
AAA  2      6m      cvm_test    0.09    0.17    0.03    0.05
AAA  2      1y      ks_test     0.04    0.06    0.02    0.03
AAA  2      1y      cvm_test    0.09    0.17    0.03    0.05

2 个回答

Voted

jezrael · Answer 1 · 2025-04-04T18:30:43+08:00

使用GroupBy.aggwithDataFrame.stack重塑列中 MultiIndex 的最后一级：

cols = ['id','hrz', 'tenor']
out = (df.groupby(cols)[df.columns.difference(cols + ['date'], sort=False)]
        .agg([ks_test, cvm_test])
        .rename_axis([None, 'test'], axis=1)
        .stack(future_stack=True)
        .reset_index())

print (out)
    id  hrz tenor      test         1         2         3         4
0  AAA    2    1y   ks_test  0.278182  0.166364  0.254545  0.224545
1  AAA    2    1y  cvm_test  0.220803  0.044730  0.158839  0.118321
2  AAA    2    6m   ks_test  0.278182  0.166364  0.254545  0.224545
3  AAA    2    6m  cvm_test  0.220803  0.044730  0.158839  0.118321

工作原理：

print (df.groupby(cols)[df.columns.difference(cols +['date'], sort=False)]
        .agg([ks_test, cvm_test]))

                      1                   2                  3            \
                ks_test  cvm_test   ks_test cvm_test   ks_test  cvm_test   
id  hrz tenor                                                              
AAA 2   1y     0.278182  0.220803  0.166364  0.04473  0.254545  0.158839   
        6m     0.278182  0.220803  0.166364  0.04473  0.254545  0.158839   

                      4            
                ks_test  cvm_test  
id  hrz tenor                      
AAA 2   1y     0.224545  0.118321  
        6m     0.224545  0.118321

mozway · Answer 2 · 2025-04-04T18:30:03+08:00

您可以建立一个组，然后应用您的功能groupby.agg和concat输出：

group = ['id', 'hrz', 'tenor']
cols = df.columns.difference(group+['date'])

g = df.groupby(group)[cols]

out = (pd.concat({'ks_test': g.agg(ks_test),
                  'cvm_test': g.agg(cvm_test),
                 }, names=['test'])
       .sort_index(level=group, kind='stable', sort_remaining=False)
       .reset_index()
      )

输出：

       test   id  hrz tenor         1         2         3         4
0   ks_test  AAA    2    1y  0.278182  0.166364  0.254545  0.224545
1  cvm_test  AAA    2    1y  0.220803  0.044730  0.158839  0.118321
2   ks_test  AAA    2    6m  0.278182  0.166364  0.254545  0.224545
3  cvm_test  AAA    2    6m  0.220803  0.044730  0.158839  0.118321

或者，传递您的函数和stack（在这种情况下，名称由函数的名称定义）：

group = ['id', 'hrz', 'tenor']
cols = df.columns.difference(group+['date'])

out = (
    df.groupby(group)[cols]
    .agg([ks_test, cvm_test])
    .rename_axis([None, 'test'], axis=1)
    .stack()
    .reset_index()
)

输出：

    id  hrz tenor      test         1         2         3         4
0  AAA    2    1y   ks_test  0.278182  0.166364  0.254545  0.224545
1  AAA    2    1y  cvm_test  0.220803  0.044730  0.158839  0.118321
2  AAA    2    6m   ks_test  0.278182  0.166364  0.254545  0.224545
3  AAA    2    6m  cvm_test  0.220803  0.044730  0.158839  0.118321

在python pandas中的多列上按分组并应用多个自定义函数

重新格式化数字，在固定位置插入分隔符

为什么 C++20 概念会导致循环约束错误，而老式的 SFINAE 不会？

VScode 自动卸载扩展的问题（Material 主题）

Vue 3：创建时出错“预期标识符但发现‘导入’”[重复]

具有指定基础类型但没有枚举器的“枚举类”的用途是什么？

如何修复未手动导入的模块的 MODULE_NOT_FOUND 错误？

`(表达式，左值) = 右值` 在 C 或 C++ 中是有效的赋值吗？为什么有些编译器会接受/拒绝它？

在 C++ 中，一个不执行任何操作的空程序需要 204KB 的堆，但在 C 中则不需要

PowerBI 目前与 BigQuery 不兼容：Simba 驱动程序与 Windows 更新有关

AdMob：MobileAds.initialize() - 对于某些设备，“java.lang.Integer 无法转换为 java.lang.String”

在python pandas中的多列上按分组并应用多个自定义函数

2 个回答

相关问题