我有一个镶木地板文件,该文件正在通过如下连续循环进行写入:
def process_data(self):
#... other code ...
with pq.ParquetWriter(self.destination_file, schema) as writer:
with tqdm(total=total_rows, desc="Processing nodes") as pbar:
for i in range(0, total_rows, self.batch_size):
# ... processing code ...
# Create a table from the batched data
batch_table = pa.Table.from_arrays(
[
pa.array(node_ids),
pa.array(mut_positions),
pa.array(new_6mers),
pa.array(context_embeddings),
pa.array(nonmutation_contexts),
],
schema=schema
)
# Write the batch table
writer.write_table(batch_table)
# ...
pbar.update(len(batch_indices))
由于计算机在此过程中突然关闭,此循环被突然切断。
现在,当我尝试读取文件时pq.read_table
,我(预计)收到一个错误
pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from 'data/processed/data_with_embeddings.parquet'. Is this a 'parquet' file?: Could not open Parquet input source 'data/processed/data_with_embeddings.parquet': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.
我迫切希望有办法解决这个问题。比如一种变通方法。比如丢失几行(或更多行),但保存大部分数据。我搜索了网络,但似乎没有关于此的任何信息,或者现有的信息超出了我的专业知识范围(这可能从我的标签使用中可以看出,我提前道歉)。
还有希望吗?