我正在使用 Polars,我有一个数据集,其中有一列是字符串列表。要查看它是什么样子:
import pandas as pd
list_of_lists = [['base', 'base.current base', 'base.current base.inventories - total', 'ABCD']
, ['base', 'base.current base', 'base.current base.inventories - total','ABCD']
, ['base', 'base.current base', 'base.current base.inventories - total', 'ABCD']
, ['base', 'base.current base', 'base.current base.inventories - total', 'ABCD']]
pd_df = pd.DataFrame({'lol': list_of_lists})
给出:
lol
0 ['base', 'base.current base', 'base.current base.inventories - total', 'ABCD']
1 ['base', 'base.current base', 'base.current base.inventories - total', 'ABCD']
2 ['base', 'base.current base', 'base.current base.inventories - total', 'ABCD']
3 ['base', 'base.current base', 'base.current base.inventories - total', 'ABCD']
我想对每个列表进行哈希处理。我想将每个列表转换为字符串,然后对其进行哈希处理。我可以用 Pandas 做到这一点
pd_df = pd.DataFrame({'lol': list_of_lists}).astype({'lol':str})
pl_df_1 = pl.DataFrame(pd_df)
pl_df_1.with_columns(pl.col('lol')
.hash(seed=140)
.name.suffix('_hashed')
)
给出:
lol lol_hashed
str u64
"['base', 'base.current base', … 14283628883798345624
"['base', 'base.current base', … 14283628883798345624
"['base', 'base.current base', … 14283628883798345624
"['base', 'base.current base', … 14283628883798345624
但如果我尝试在 Polars 中执行类似操作,我会收到错误:
pl_df_2 = pl.DataFrame({'lol': list_of_lists})
pl_df_2.with_columns(pl.col('lol') # <== can insert .cast(pl.String) here still get error
.hash(seed=140)
.name.suffix('_hashed')
)
给出:
# PanicException: Hashing a list with a non-numeric inner type not supported.
# Got dtype: List(String)
我更愿意使用 Polars 库,那么是否可以将列表列转换为字符串,或者在 Polars 中是否有更好的方法来实现相同的结果?
更新:
根据接受的答案,我进行了进一步的实验。
list_of_lists = [
['base', 'base.current base', 'base.current base.inventories - total', 'ABCD'],
['base', 'base.current base', 'base.current base.inventories - total', 'DEFG'],
['base', 'base.current base', 'base.current base.inventories - total', 'ABCD'],
['base', 'base.current base', 'base.current base.inventories - total', 'HIJK'],
'(bobbyJoe460)',
'bobby, Joe (xx866e)',
137642039575
]
pl_df_1 = pl.DataFrame({'lol': list_of_lists}, strict=False) # <==== allow mixed types in column
pl_df_1.with_columns(pl.col('lol')
.cast(pl.Categorical) # <==== cast to Categorical
.hash(seed=140)
.name.suffix('_hashed')
)
给出:
lol. lol_hashed
str u64
"["base", "base.current base", … 11231070086490249882
"["base", "base.current base", … 6519339301964281776
"["base", "base.current base", … 11231070086490249882
"["base", "base.current base", … 14549859594875138034
"(bobbyJoe460)" 1954884316252525743
"bobby, Joe (xx866e)" 4241414284122449899
"137642039575" 6383308039250228053
PanicException 是一个错误,可以报告。
.list.join()
可用于创建可进行散列的“单个字符串”。您还可以转换为可以散列的分类。