当前位置: 首页>后端>正文

拆分soap比对过滤宿主污染后合并的无污染的fasta

宏基因组测序分析过程中经常要去除宿主污染,比较好用的是soap,它要求输入双端测序文件,然而它输出合并了的双端文件中未比对上宿主的reads。

soap -a A57RS1T1_1.clean.fq.gz -b A57RS1T1_2.clean.fq.gz -D /public/home/jkyin/biosoft/database/soap2_solanum/GCF_000188115.4_SL3.0_genomic.fna.index -o output -2 unpaired -u unmapped.fa
#其中unmapped.fa就是我们需要的去除污染后的文件

像这样

>A00183:529:HWT25DSXX:3:1101:13612:1579
GACTGGGCCGACGGCGACGACTCGCCGGCCGCCTTCATCGAGGAGATTCCGCCCGCCGGCCGCCTGTTTTTCCGCCTGCAAACCCGCCCCGTCGCCATGCCCGCGAGCGCGCCCGAACACGCCGGCTTTTTTCTCACCTACGCCGACTTC
>A00183:529:HWT25DSXX:3:1101:13612:1579
ATCCACGGCCACGGGGTCGTGGTGGGCGGCGAGCAGGGCGTGGAATTTTTGGTGCGCGCCGGCGAAGAGCGTCTCGTTGTGCCAAAGGGTGTCGTCGGCGTCGAAGCCGAGGGCGAGCGGTGCGTCGGTCATGGGCGAGGAAGTGCCCAG
>A00183:529:HWT25DSXX:3:1101:14660:1579
TGGATGGGGAACAACCTGCCAAAAGCTGACCAGTTGTTCGCGGGCGAGCGTTGGGTACTGTGCGAATCGGCCGGGAATGTCGCGGGTACGAGTTGCACTGGTGACACGATTGGTAAGGCACAGTTGATCGCTGATCTCAGCCAGCGCTAC
>A00183:529:HWT25DSXX:3:1101:14660:1579
TATTCTGCGATGCCAGCCAAAACAGACCGAGAATCGGCGACTGTGTTCCAGTCTGCAAGCTCAACTTCTTAGCAGCATCTTTTAGATCGGCGTAGCGCAACACCTTTGCGTTCTTGAAGTACGTTCGCCACTGCGCGATGAAATCGTTCA
>A00183:529:HWT25DSXX:3:1101:14733:1579
GGTGACGCCAGCTCCTTCATGGCGGCGGAGACCGCGCGCGGATCCCCGTCGCCCACGCGCGTCATCGCCAGCGCCATCAACCCCAGGTAGTCGGCGAAGGCGCTCAAGCGCGGCTCGGGCACCAACGACTCACCGAAACGAGCGTAGTAG
>A00183:529:HWT25DSXX:3:1101:14733:1579
TGATATTCCTGACCAGCGCCAACCTCTCCGCGCTGCTCGATACGGTCTTCTGGACGGAGCGCGCCGGCTACACCGGCCTGTATTATTTCGCCCACGGCGCGTTCCTCGCCAGCGTGAGCGCGTCGATGCCGGGCTCGCTCGTCCCGCAGC
>A00183:529:HWT25DSXX:3:1101:15076:1579
GGCGAACCCCTACCTCGCTTTCGCCGCGCTCCTGATGGCGGGCCTCGACGGCGTGCAGAACAAGATCCATCCCGGCGACCCGCTCGACAAGAACCTCTACGACCACCCGGCCGAGGAAGCCGCGAAGGTGCCGCATCCGTGCGCATCGCT
>A00183:529:HWT25DSXX:3:1101:15076:1579
AATACATGTCGAATTCGATCGGGTGCGTGGTCATGCGGAAGCGCTGCACCTCCTCGTACTTGAGCGCGATGTACGCATCGAGCATGTCGTTGGTGAAGACGCCGCCGCGCGTGAGGTACTCGCGATCGCGGTCGAGGTTCTCGAGCGCCT
>A00183:529:HWT25DSXX:3:1101:15112:1579
GGCGTCGTGGATCTGGGTGCCGACGCCGCCGATGATGTGGTCAGGCGCGGGGAGTTCGCCGGAGGCGATGAGGCTGCGGGTGTCGTCGAGGAGGCGGCCTGAGTTATAGACGAGGAGCGGGCGGGTGTCGGCGGGGAGGGCGGCCCAGGC
>A00183:529:HWT25DSXX:3:1101:15112:1579
GAGCGACCAAAAACCTTCCACCTCCGCGCCCACCCCCCACCCCCCCATCCGCCTCTTCAGCACCGATCTTGACGGCACCCACCTCGGCAACCACGAGGCCACCCAACCCATCGCCACCGCCAGGGCCGCCCAACCCCCCGACACCCCCCC

但是,后面你再组装基因组时,大部分软件的输入就需要双端,Google了,有人写拆分合并后的fastq文件的脚本,找了很久也没找到拆分上面这种格式的fasta的脚本。那我自己写吧。。。

  • 有觉得这段代码的有用的随便拿去,不用联系本人,点个赞也行吧O(∩_∩)O
# -*- coding: utf-8 -*-
'''
This is a script for spliting sequence from a fasta file
'''
import sys
def usage():
        print('Usage: python3 split_merged_fa.py [fasta_file] [pair1_file] [pair2_file]')
def main():
        pair1 = open(sys.argv[2],"w")
        pair2 = open(sys.argv[3],"w")
        dic1 = {}
        dic2 = {}
        read_num = 0
        fi = open(sys.argv[1],"r")
        fi_all = fi.readlines()
        fi.seek(0) #move index to top, or you'll get nothing
        for line in fi:
                if line.startswith(">"):
                        read_num += 1
                        line_new = line.strip()
                        seq_line = 2*read_num - 1
                        if read_num % 2 == 0:
                                dic2[line_new] = fi_all[seq_line].strip()
                        else:
                                dic1[line_new] = fi_all[seq_line].strip()
        for p1 in dic1:
                pair1.write(p1 + "\n")
                pair1.write(dic1[p1] + "\n")
        for p2 in dic2:
                pair2.write(p2 + "\n")
                pair2.write(dic2[p2] + "\n")
        pair1.close()
        pair2.close()
        fi.close()
try:
        main()
except  IndexError:
        usage()

测试一下

(base) [jkyin@mn02 test]$ time python split_merged_fa.py split_merged_fa_test.fa split_1.fa split_2.fa
  • split_merged_fa_test.fa长这样
>A00183:529:HWT25DSXX:3:1101:13612:1579
GACTGGGCCGACGGCGACGACTCGCCGGCCGCCTTCATCGAGGAGATTCCGCCCGCCGGCCGCCTGTTTTTCCGCCTGCAAACCCGCCCCGTCGCCATGCCCGCGAGCGCGCCCGAACACGCCGGCTTTTTTCTCACCTACGCCGACTTC
>A00183:529:HWT25DSXX:3:1101:13612:1579
ATCCACGGCCACGGGGTCGTGGTGGGCGGCGAGCAGGGCGTGGAATTTTTGGTGCGCGCCGGCGAAGAGCGTCTCGTTGTGCCAAAGGGTGTCGTCGGCGTCGAAGCCGAGGGCGAGCGGTGCGTCGGTCATGGGCGAGGAAGTGCCCAG
>A00183:529:HWT25DSXX:3:1101:14660:1579
TGGATGGGGAACAACCTGCCAAAAGCTGACCAGTTGTTCGCGGGCGAGCGTTGGGTACTGTGCGAATCGGCCGGGAATGTCGCGGGTACGAGTTGCACTGGTGACACGATTGGTAAGGCACAGTTGATCGCTGATCTCAGCCAGCGCTAC
>A00183:529:HWT25DSXX:3:1101:14660:1579
TATTCTGCGATGCCAGCCAAAACAGACCGAGAATCGGCGACTGTGTTCCAGTCTGCAAGCTCAACTTCTTAGCAGCATCTTTTAGATCGGCGTAGCGCAACACCTTTGCGTTCTTGAAGTACGTTCGCCACTGCGCGATGAAATCGTTCA
>A00183:529:HWT25DSXX:3:1101:14733:1579
GGTGACGCCAGCTCCTTCATGGCGGCGGAGACCGCGCGCGGATCCCCGTCGCCCACGCGCGTCATCGCCAGCGCCATCAACCCCAGGTAGTCGGCGAAGGCGCTCAAGCGCGGCTCGGGCACCAACGACTCACCGAAACGAGCGTAGTAG
>A00183:529:HWT25DSXX:3:1101:14733:1579
TGATATTCCTGACCAGCGCCAACCTCTCCGCGCTGCTCGATACGGTCTTCTGGACGGAGCGCGCCGGCTACACCGGCCTGTATTATTTCGCCCACGGCGCGTTCCTCGCCAGCGTGAGCGCGTCGATGCCGGGCTCGCTCGTCCCGCAGC
>A00183:529:HWT25DSXX:3:1101:15076:1579
GGCGAACCCCTACCTCGCTTTCGCCGCGCTCCTGATGGCGGGCCTCGACGGCGTGCAGAACAAGATCCATCCCGGCGACCCGCTCGACAAGAACCTCTACGACCACCCGGCCGAGGAAGCCGCGAAGGTGCCGCATCCGTGCGCATCGCT
>A00183:529:HWT25DSXX:3:1101:15076:1579
AATACATGTCGAATTCGATCGGGTGCGTGGTCATGCGGAAGCGCTGCACCTCCTCGTACTTGAGCGCGATGTACGCATCGAGCATGTCGTTGGTGAAGACGCCGCCGCGCGTGAGGTACTCGCGATCGCGGTCGAGGTTCTCGAGCGCCT
>A00183:529:HWT25DSXX:3:1101:15112:1579
GGCGTCGTGGATCTGGGTGCCGACGCCGCCGATGATGTGGTCAGGCGCGGGGAGTTCGCCGGAGGCGATGAGGCTGCGGGTGTCGTCGAGGAGGCGGCCTGAGTTATAGACGAGGAGCGGGCGGGTGTCGGCGGGGAGGGCGGCCCAGGC
>A00183:529:HWT25DSXX:3:1101:15112:1579
GAGCGACCAAAAACCTTCCACCTCCGCGCCCACCCCCCACCCCCCCATCCGCCTCTTCAGCACCGATCTTGACGGCACCCACCTCGGCAACCACGAGGCCACCCAACCCATCGCCACCGCCAGGGCCGCCCAACCCCCCGACACCCCCCC
  • split_1.fa长这样
>A00183:529:HWT25DSXX:3:1101:13612:1579
GACTGGGCCGACGGCGACGACTCGCCGGCCGCCTTCATCGAGGAGATTCCGCCCGCCGGCCGCCTGTTTTTCCGCCTGCAAACCCGCCCCGTCGCCATGCCCGCGAGCGCGCCCGAACACGCCGGCTTTTTTCTCACCTACGCCGACTTC
>A00183:529:HWT25DSXX:3:1101:14660:1579
TGGATGGGGAACAACCTGCCAAAAGCTGACCAGTTGTTCGCGGGCGAGCGTTGGGTACTGTGCGAATCGGCCGGGAATGTCGCGGGTACGAGTTGCACTGGTGACACGATTGGTAAGGCACAGTTGATCGCTGATCTCAGCCAGCGCTAC
>A00183:529:HWT25DSXX:3:1101:14733:1579
GGTGACGCCAGCTCCTTCATGGCGGCGGAGACCGCGCGCGGATCCCCGTCGCCCACGCGCGTCATCGCCAGCGCCATCAACCCCAGGTAGTCGGCGAAGGCGCTCAAGCGCGGCTCGGGCACCAACGACTCACCGAAACGAGCGTAGTAG
>A00183:529:HWT25DSXX:3:1101:15076:1579
GGCGAACCCCTACCTCGCTTTCGCCGCGCTCCTGATGGCGGGCCTCGACGGCGTGCAGAACAAGATCCATCCCGGCGACCCGCTCGACAAGAACCTCTACGACCACCCGGCCGAGGAAGCCGCGAAGGTGCCGCATCCGTGCGCATCGCT
>A00183:529:HWT25DSXX:3:1101:15112:1579
GGCGTCGTGGATCTGGGTGCCGACGCCGCCGATGATGTGGTCAGGCGCGGGGAGTTCGCCGGAGGCGATGAGGCTGCGGGTGTCGTCGAGGAGGCGGCCTGAGTTATAGACGAGGAGCGGGCGGGTGTCGGCGGGGAGGGCGGCCCAGGC
  • split_2.fa长这样
>A00183:529:HWT25DSXX:3:1101:13612:1579
ATCCACGGCCACGGGGTCGTGGTGGGCGGCGAGCAGGGCGTGGAATTTTTGGTGCGCGCCGGCGAAGAGCGTCTCGTTGTGCCAAAGGGTGTCGTCGGCGTCGAAGCCGAGGGCGAGCGGTGCGTCGGTCATGGGCGAGGAAGTGCCCAG
>A00183:529:HWT25DSXX:3:1101:14660:1579
TATTCTGCGATGCCAGCCAAAACAGACCGAGAATCGGCGACTGTGTTCCAGTCTGCAAGCTCAACTTCTTAGCAGCATCTTTTAGATCGGCGTAGCGCAACACCTTTGCGTTCTTGAAGTACGTTCGCCACTGCGCGATGAAATCGTTCA
>A00183:529:HWT25DSXX:3:1101:14733:1579
TGATATTCCTGACCAGCGCCAACCTCTCCGCGCTGCTCGATACGGTCTTCTGGACGGAGCGCGCCGGCTACACCGGCCTGTATTATTTCGCCCACGGCGCGTTCCTCGCCAGCGTGAGCGCGTCGATGCCGGGCTCGCTCGTCCCGCAGC
>A00183:529:HWT25DSXX:3:1101:15076:1579
AATACATGTCGAATTCGATCGGGTGCGTGGTCATGCGGAAGCGCTGCACCTCCTCGTACTTGAGCGCGATGTACGCATCGAGCATGTCGTTGGTGAAGACGCCGCCGCGCGTGAGGTACTCGCGATCGCGGTCGAGGTTCTCGAGCGCCT
>A00183:529:HWT25DSXX:3:1101:15112:1579
GAGCGACCAAAAACCTTCCACCTCCGCGCCCACCCCCCACCCCCCCATCCGCCTCTTCAGCACCGATCTTGACGGCACCCACCTCGGCAACCACGAGGCCACCCAACCCATCGCCACCGCCAGGGCCGCCCAACCCCCCGACACCCCCCC

再试一个大文件:合并后19Gb,跑起来比较占内存,但也还算快~8min CPU时。建议没有足够RAM的拆分后再跑,然后合并

拆分教程参考:
http://www.biostack.org/?p=504

python ./split_merged_fa.py HF12RS1T2C.fa HF12RS1T2C_R1.fa HF12RS1T2C_R2.fa

Resource usage summary:

    CPU time :                                   488.69 sec.
    Max Memory :                                 67597 MB
    Average Memory :                             45730.43 MB
    Total Requested Memory :                     10000.00 MB
    Delta Memory :                               -57597.00 MB
    Max Swap :                                   -
    Max Processes :                              4
    Max Threads :                                5
    Run time :                                   1202 sec.
    Turnaround time :                            1302 sec.

OK,吃饭。


https://www.xamrdz.com/backend/3jw1872532.html

相关文章: