我有 2 个输入文件。输入 file1 看起来像这样
Equus caballus
Monodelphis domestica
Saccharomyces cerevisiae S288c
Input2 看起来像这样(显示前 10 行)
>CM000377.2/60448635-60448529 Equus caballus chromosome 1, whole genome shotgun sequence. ATCGCTTCTCGGCCTTTTGGCTAAGATCAAGTGTAGTATCTATTCTTATCAGTTTAAAACTAGTGGTGAAATGAGATGTAGACAGTAACATTTGAATTACAACATCA
>CM000377.2/105043590-105043453 Equus caballus chromosome 1, whole genome shotgun sequence. ATTGCTTCTTGGCCTTTTGGCTAAGATCAAGTATAGTATCTGTTCTCATCAATTTAAAAATGGCAATATAAATAGACCCATAGTAGATCCAGATAATGGTGTTATCAGAAAAGGACTTTAAGTAATTTAATATGTTCA
>CM000377.2/137942042-137941941 Equus caballus chromosome 1, whole genome shotgun sequence. ATCGCTTCTCAGACTTTTGGCTAAGATCAAGCGTAGTATCTGTTCTTATCAGTAATTAACTTCAGAAAAGTTAACTCATCTTCAGCAAGGCAGTAATCCCCT
>CM000377.2/97988860-97989002 Equus caballus chromosome 1, whole genome shotgun sequence. ATCGCTTCTTGGCCTTTTGGCTAAGATCAAGTGTAGGAATCAATGAATTTCTGGTTATGGAGGCTAAAATGATATCTAATCTTGACTTAATCTAGGTCTCTTCAGTATTTGTCACCCTTTACTACATTCTCTGCTGATGCACT
>CM000377.2/77415658-77415776 Equus caballus chromosome 1, whole genome shotgun sequence. ACTGCTTCTTCGCCTTTTGGCTAAAATCAAGTATAGTATCTGTTCTTACCAGTTTAAGTACTTTTTGTGCTTCTCATGGCTATAAGCCATAATTGCTGTTATAACGGTAAGGATTTTTC
>CM000377.2/172045138-172045024 Equus caballus chromosome 1, whole genome shotgun sequence. ATTGCTTCTTGGCCTTTCAGCTAAGATCAAGTGTTGTATCTGTTCGTATCAGTTTAAATCATTCTGCACCAAAGATATGTCTCTTCTTCTCCATTTATTAATTTGTTCACTTATT
>CM000378.2/50070490-50070688 Equus caballus chromosome 2, whole genome shotgun sequence. ATTGCTTCTCGGCCTTTTGGCTAAGATCAAGTGTAGTAATTGATTATCTCAAGTTAAGGAGAACTCACTACATCCCAAAGTCTCATTCTTTGTCTGAGTCTTGACACACATACTTCTTTCTGTGAGTATGTCCCTATTGCCTGCAATTGGCAATCTAAACATTCAGTGAAAATCTTCATTAGCTTTGAATGAACCATGT
>CM000378.2/21366877-21367061 Equus caballus chromosome 2, whole genome shotgun sequence. AAAGCGTCTCAGCCTTTTGGCTAAGATCAAGTGTAGTATCTGTAGCTAGTCTATAACCTGATTGATATGTCCATTTTACCCCAATATCATACCATTATGATTACTGTGGCTTTATATAGCAAATCTTGAACTCAGGTAGTATAAATCCTCTAACTCTGTTCTTTGTCAAAATGGTCTTGGCTATT
>CM000378.2/56987690-56987788 Equus caballus chromosome 2, whole genome shotgun sequence. ATCGCTTCTCGGCCTTTTGGCTAAGATCAAGTGTAGTATCTGAACGTCGGCGCCCTCGTGAGGAGGCACAGCCTCTCGTTCCCTGCTCCTACACTCCTT
>CM000378.2/18244103-18244249 Equus caballus chromosome 2, whole genome shotgun sequence. ATCGCTTCTCGGCCTTTTGGCTGAGATCAAGTGTAGAGCTTTGAATAGTATAATAATATTATTTTGATAGTAATAACAATAAACAATCGCTAGCATTAATGAGAGCTTAGTGTATGCCAGTCACCATGCTAAGTGCTCTAGATGCTT
>CM000370.1/74459482-74459563 Monodelphis domestica chromosome 3, whole genome shotgun sequence. ATCACTTCTCTGCCTTTTGGCTAAGATCAAGTGTAGTATCAATAGATGCAGAAAGAGCTTTTGACAAAATACAACACCCATT
>CM000370.1/105243828-105243703 Monodelphis domestica chromosome 3, whole genome shotgun sequence. ATTGTTTCTTGGCCTTTTGGCTAAGATCAAGTGTAGAAATATTGTTAAATAATTACTTGTAAGATCTCGGAGAAACTAGAGAAGGTATTTATTGTACCTGGGAGTTTCCCATTCCTGGAACTCTCT
>CM000370.1/143474511-143474342 Monodelphis domestica chromosome 3, whole genome shotgun sequence. ATTGCTTCTCAACCTTTTGGCTAAGATCAAGTGTAGTATCTATATCCCAATGATGTTTGGGATACTTAGTATTTGGGCAGCTAGAACTCCTCTTCCTGAGTTAAAATCCAGCCAATCACTAGCTGTGTGGCCTTGGGTAAGTCACTTAACCCAGTTTGCCTCAGTTGTCT
>CM000371.1/104846407-104846597 Monodelphis domestica chromosome 4, whole genome shotgun sequence. ATCGCTTCTCGGCCTTTTGGCTAAGATCAAGTGTAGTATCTGTTCTTATCAGTTTAATATCTGATACGTCCTCTATCCGAGGACAATATATTAAATGGATTTTTGAAGCAGGGAGTCGGAATAGGAGCTTGCTCCGTCCACTCCACGCATCGACCTGGTATTGCAGTACTTCCAGGAACGGTGCACCTCCC
>CM000371.1/104773987-104774177 Monodelphis domestica chromosome 4, whole genome shotgun sequence. ATCGCTTCTCGGCCTTTTGGCTAAGATCAAGTGTAGTATCTGTTCTTATCAGTTTAATATCTGATACGTCCTCTATCCGAGGACAATATATTAAATGGATTTTTGAAACAGGGAGTCGGAATAGGAGCTTGCTCCGTCCACTCCACGCATCGACCTGGTATTGCAGTACTTCCAGGAACGGTGCACTTCCC
>BK006936.2/681858-681747 TPA: Saccharomyces cerevisiae S288c chromosome II, complete sequence. ATCTCTTTGCCTTTTGGCTTAGATCAAGTGTAGTATCTGTTCTTTTCAGTGTAACAACTGAAATGACCTCAATGAGGCTCATTACCTTTTAATTTGTTACAATACACATTTT
我想从输入文件 2 中查找与输入文件 1 匹配的行,并对它们进行计数以获得输入文件 1 中的行在输入文件 2 中出现的总次数
输出示例
Equus caballus 10
Monodelphis domestica 5
Saccharomyces cerevisiae S288c 1
等等。
我用它从 input2 文件中提取匹配的行在 file1
grep -Fwf input1 input2
如何计算 input1 中的每一行在 input2 中出现的次数?
您可以通过以下方式完成此操作:
这将从你想要的方式向后产生输出,但如果你需要它们的顺序,你可以这样做:
你可以在 Awk 中使用数组来做到这一点:
(它可以用一个数组来完成,但恕我直言,对于索引集和计数来说,使用单独的数组会更干净。)
这是一个 Python 脚本,可以执行您想要的操作:
你可以像这样运行它:
在您给定的示例数据上,它提供以下输出:
这是做同样事情的 Bash 脚本:
输出:
解释:
-n
Perl
除非被要求,否则将逐行调用读取文件并且不打印。-l
会让RS = ORS = \n
涉及的数据结构:
%h
当基因从 .hash 中读取时,hash将具有键file1
。@h
将按照从中读取时遇到的顺序包含基因(非重复)file1
。%s
应具有具有基因的键和值作为该基因在中出现的次数file2
。在职的:
@ARGV
读取第一个参数(file1)时应包含 1 个文件的内容,读取第二个参数(file2)时应为空。因此,第一行将file
仅适用于并将填充散列%h
和数组@h
。file2
并将哈希值更新为%s
在给定行上找到特定基因的次数。index(str, substr)
如果找到,函数将返回子串在字符串中的位置,失败时返回 -1。%s
根据 array 设置的键的顺序打印哈希的内容@h
。