我有一个包含分层信息和 KO 编号的数据集,我希望将此数据格式化为 Python 中的 TSV(制表符分隔值)文件,其中第一列包含 KO 编号,第二列包含描述,第三列包含列包含基于输入数据中最近的“A”部分的层次结构。层次结构应包括从“A”、“B”和“C”开始直到最近的“C”部分的元素。此外,如果相同的 KO 编号出现在不同的等级中,则该等级应用 | 分隔。同一行下的输入数据是 file.keg 格式输入数据:
A09100 Metabolism
B
B 09101 Carbohydrate metabolism
C 00010 Glycolysis / Gluconeogenesis [PATH:ko00010]
D K00844 HK; hexokinase [EC:2.7.1.1]
D K12407 GCK; glucokinase [EC:2.7.1.2]
D K00001 E1.1.1.1, adh; alcohol dehydrogenase [EC:1.1.1.1]
B 09103 Lipid metabolism
C 00071 Fatty acid degradation [PATH:ko00071]
D K00001 E1.1.1.1, adh; alcohol dehydrogenase [EC:1.1.1.1]
A09120 Genetic Information Processing
B
B 09121 Transcription
C 03020 RNA polymerase [PATH:ko03020]
D K03043 rpoB; DNA-directed RNA polymerase subunit beta [EC:2.7.7.6]
D K13797 rpoBC; DNA-directed RNA polymerase subunit beta-beta' [EC:2.7.7.6]
预期输出:
KO metadata_KEGG_Description metadata_KEGG_Pathways
K00844 HK; hexokinase [EC:2.7.1.1] Metabolism, Carbohydrate metabolism, Glycolysis / Gluconeogenesis
K12407 GCK; glucokinase [EC:2.7.1.2] Metabolism, Carbohydrate metabolism, Glycolysis / Gluconeogenesis
K00001 E1.1.1.1, adh; alcohol dehydrogenase [EC:1.1.1.1] Metabolism, Carbohydrate metabolism, Glycolysis / Gluconeogenesis|Metabolism, Lipid metabolism, Fatty acid degradation
K03043 rpoB; DNA-directed RNA polymerase subunit beta [EC:2.7.7.6] Genetic Information Processing, Transcription, RNA polymerase
K13797 rpoBC; DNA-directed RNA polymerase subunit beta-beta' [EC:2.7.7.6] Genetic Information Processing, Transcription, RNA polymerase
对于如何根据提供的分层信息正确处理这些数据到所需的 TSV 文件中的任何帮助或指导,我将不胜感激。谢谢您的帮助!
这是我的代码
data = """A09100 Metabolism
B
B 09101 Carbohydrate metabolism
C 00010 Glycolysis / Gluconeogenesis [PATH:ko00010]
D K00844 HK; hexokinase [EC:2.7.1.1]
D K12407 GCK; glucokinase [EC:2.7.1.2]
D K00001 E1.1.1.1, adh; alcohol dehydrogenase [EC:1.1.1.1]
B 09103 Lipid metabolism
C 00071 Fatty acid degradation [PATH:ko00071]
D K00001 E1.1.1.1, adh; alcohol dehydrogenase [EC:1.1.1.1]
A09120 Genetic Information Processing
B
B 09121 Transcription
C 03020 RNA polymerase [PATH:ko03020]
D K03043 rpoB; DNA-directed RNA polymerase subunit beta [EC:2.7.7.6]
D K13797 rpoBC; DNA-directed RNA polymerase subunit beta-beta' [EC:2.7.7.6]"""
lines = data.split('\n')
result = []
ko = None
description = None
hierarchy_names = []
for line in lines:
parts = line.strip().split()
if parts:
if parts[0].startswith('A'):
# Reset hierarchy for a new 'A' section
hierarchy_names = [" ".join(parts[1:])]
elif parts[0] == 'K':
ko = parts[0]
description = " ".join(parts[1:])
elif parts[0] == 'D' and len(parts) >= 3:
ko = parts[1]
description = " ".join(parts[2:])
else:
hierarchy_names.append(" ".join(parts[1:]))
if ko and description:
hierarchy_str = ", ".join(hierarchy_names)
result.append([ko, description, hierarchy_str])
# Add the header row
result.insert(0, ["KO", "metadata_KEGG_Description", "metadata_KEGG_Pathways"])
# Specify the filename for the TSV file
tsv_filename = "output_data.tsv"
with open(tsv_filename, 'w') as tsv_file:
for row in result:
tsv_file.write("\t".join(row) + "\n")
print(f"Data saved to {tsv_filename}")
我建议您检查Orange Bioinformatics处理 KEGG 文件的
DBGETEntryParser
方法。否则,如果您想在一些正则表达式帮助下使用pandas,您可以尝试以下操作:输出(以表格格式):
使用networkx进行图形可视化:
我得到了这个代码谢谢