宏基因组测序分析过程中经常要去除宿主污染,比较好用的是
soap
,它要求输入双端测序文件,然而它输出合并了的双端文件中未比对上宿主的reads。
soap -a A57RS1T1_1.clean.fq.gz -b A57RS1T1_2.clean.fq.gz -D /public/home/jkyin/biosoft/database/soap2_solanum/GCF_000188115.4_SL3.0_genomic.fna.index -o output -2 unpaired -u unmapped.fa
#其中unmapped.fa就是我们需要的去除污染后的文件
像这样
>A00183:529:HWT25DSXX:3:1101:13612:1579
GACTGGGCCGACGGCGACGACTCGCCGGCCGCCTTCATCGAGGAGATTCCGCCCGCCGGCCGCCTGTTTTTCCGCCTGCAAACCCGCCCCGTCGCCATGCCCGCGAGCGCGCCCGAACACGCCGGCTTTTTTCTCACCTACGCCGACTTC
>A00183:529:HWT25DSXX:3:1101:13612:1579
ATCCACGGCCACGGGGTCGTGGTGGGCGGCGAGCAGGGCGTGGAATTTTTGGTGCGCGCCGGCGAAGAGCGTCTCGTTGTGCCAAAGGGTGTCGTCGGCGTCGAAGCCGAGGGCGAGCGGTGCGTCGGTCATGGGCGAGGAAGTGCCCAG
>A00183:529:HWT25DSXX:3:1101:14660:1579
TGGATGGGGAACAACCTGCCAAAAGCTGACCAGTTGTTCGCGGGCGAGCGTTGGGTACTGTGCGAATCGGCCGGGAATGTCGCGGGTACGAGTTGCACTGGTGACACGATTGGTAAGGCACAGTTGATCGCTGATCTCAGCCAGCGCTAC
>A00183:529:HWT25DSXX:3:1101:14660:1579
TATTCTGCGATGCCAGCCAAAACAGACCGAGAATCGGCGACTGTGTTCCAGTCTGCAAGCTCAACTTCTTAGCAGCATCTTTTAGATCGGCGTAGCGCAACACCTTTGCGTTCTTGAAGTACGTTCGCCACTGCGCGATGAAATCGTTCA
>A00183:529:HWT25DSXX:3:1101:14733:1579
GGTGACGCCAGCTCCTTCATGGCGGCGGAGACCGCGCGCGGATCCCCGTCGCCCACGCGCGTCATCGCCAGCGCCATCAACCCCAGGTAGTCGGCGAAGGCGCTCAAGCGCGGCTCGGGCACCAACGACTCACCGAAACGAGCGTAGTAG
>A00183:529:HWT25DSXX:3:1101:14733:1579
TGATATTCCTGACCAGCGCCAACCTCTCCGCGCTGCTCGATACGGTCTTCTGGACGGAGCGCGCCGGCTACACCGGCCTGTATTATTTCGCCCACGGCGCGTTCCTCGCCAGCGTGAGCGCGTCGATGCCGGGCTCGCTCGTCCCGCAGC
>A00183:529:HWT25DSXX:3:1101:15076:1579
GGCGAACCCCTACCTCGCTTTCGCCGCGCTCCTGATGGCGGGCCTCGACGGCGTGCAGAACAAGATCCATCCCGGCGACCCGCTCGACAAGAACCTCTACGACCACCCGGCCGAGGAAGCCGCGAAGGTGCCGCATCCGTGCGCATCGCT
>A00183:529:HWT25DSXX:3:1101:15076:1579
AATACATGTCGAATTCGATCGGGTGCGTGGTCATGCGGAAGCGCTGCACCTCCTCGTACTTGAGCGCGATGTACGCATCGAGCATGTCGTTGGTGAAGACGCCGCCGCGCGTGAGGTACTCGCGATCGCGGTCGAGGTTCTCGAGCGCCT
>A00183:529:HWT25DSXX:3:1101:15112:1579
GGCGTCGTGGATCTGGGTGCCGACGCCGCCGATGATGTGGTCAGGCGCGGGGAGTTCGCCGGAGGCGATGAGGCTGCGGGTGTCGTCGAGGAGGCGGCCTGAGTTATAGACGAGGAGCGGGCGGGTGTCGGCGGGGAGGGCGGCCCAGGC
>A00183:529:HWT25DSXX:3:1101:15112:1579
GAGCGACCAAAAACCTTCCACCTCCGCGCCCACCCCCCACCCCCCCATCCGCCTCTTCAGCACCGATCTTGACGGCACCCACCTCGGCAACCACGAGGCCACCCAACCCATCGCCACCGCCAGGGCCGCCCAACCCCCCGACACCCCCCC
但是,后面你再组装基因组时,大部分软件的输入就需要双端,Google了,有人写拆分合并后的fastq文件的脚本,找了很久也没找到拆分上面这种格式的fasta的脚本。那我自己写吧。。。
- 有觉得这段代码的有用的随便拿去,不用联系本人,点个赞也行吧O(∩_∩)O
# -*- coding: utf-8 -*-
'''
This is a script for spliting sequence from a fasta file
'''
import sys
def usage():
print('Usage: python3 split_merged_fa.py [fasta_file] [pair1_file] [pair2_file]')
def main():
pair1 = open(sys.argv[2],"w")
pair2 = open(sys.argv[3],"w")
dic1 = {}
dic2 = {}
read_num = 0
fi = open(sys.argv[1],"r")
fi_all = fi.readlines()
fi.seek(0) #move index to top, or you'll get nothing
for line in fi:
if line.startswith(">"):
read_num += 1
line_new = line.strip()
seq_line = 2*read_num - 1
if read_num % 2 == 0:
dic2[line_new] = fi_all[seq_line].strip()
else:
dic1[line_new] = fi_all[seq_line].strip()
for p1 in dic1:
pair1.write(p1 + "\n")
pair1.write(dic1[p1] + "\n")
for p2 in dic2:
pair2.write(p2 + "\n")
pair2.write(dic2[p2] + "\n")
pair1.close()
pair2.close()
fi.close()
try:
main()
except IndexError:
usage()
测试一下
(base) [jkyin@mn02 test]$ time python split_merged_fa.py split_merged_fa_test.fa split_1.fa split_2.fa
- split_merged_fa_test.fa长这样
>A00183:529:HWT25DSXX:3:1101:13612:1579
GACTGGGCCGACGGCGACGACTCGCCGGCCGCCTTCATCGAGGAGATTCCGCCCGCCGGCCGCCTGTTTTTCCGCCTGCAAACCCGCCCCGTCGCCATGCCCGCGAGCGCGCCCGAACACGCCGGCTTTTTTCTCACCTACGCCGACTTC
>A00183:529:HWT25DSXX:3:1101:13612:1579
ATCCACGGCCACGGGGTCGTGGTGGGCGGCGAGCAGGGCGTGGAATTTTTGGTGCGCGCCGGCGAAGAGCGTCTCGTTGTGCCAAAGGGTGTCGTCGGCGTCGAAGCCGAGGGCGAGCGGTGCGTCGGTCATGGGCGAGGAAGTGCCCAG
>A00183:529:HWT25DSXX:3:1101:14660:1579
TGGATGGGGAACAACCTGCCAAAAGCTGACCAGTTGTTCGCGGGCGAGCGTTGGGTACTGTGCGAATCGGCCGGGAATGTCGCGGGTACGAGTTGCACTGGTGACACGATTGGTAAGGCACAGTTGATCGCTGATCTCAGCCAGCGCTAC
>A00183:529:HWT25DSXX:3:1101:14660:1579
TATTCTGCGATGCCAGCCAAAACAGACCGAGAATCGGCGACTGTGTTCCAGTCTGCAAGCTCAACTTCTTAGCAGCATCTTTTAGATCGGCGTAGCGCAACACCTTTGCGTTCTTGAAGTACGTTCGCCACTGCGCGATGAAATCGTTCA
>A00183:529:HWT25DSXX:3:1101:14733:1579
GGTGACGCCAGCTCCTTCATGGCGGCGGAGACCGCGCGCGGATCCCCGTCGCCCACGCGCGTCATCGCCAGCGCCATCAACCCCAGGTAGTCGGCGAAGGCGCTCAAGCGCGGCTCGGGCACCAACGACTCACCGAAACGAGCGTAGTAG
>A00183:529:HWT25DSXX:3:1101:14733:1579
TGATATTCCTGACCAGCGCCAACCTCTCCGCGCTGCTCGATACGGTCTTCTGGACGGAGCGCGCCGGCTACACCGGCCTGTATTATTTCGCCCACGGCGCGTTCCTCGCCAGCGTGAGCGCGTCGATGCCGGGCTCGCTCGTCCCGCAGC
>A00183:529:HWT25DSXX:3:1101:15076:1579
GGCGAACCCCTACCTCGCTTTCGCCGCGCTCCTGATGGCGGGCCTCGACGGCGTGCAGAACAAGATCCATCCCGGCGACCCGCTCGACAAGAACCTCTACGACCACCCGGCCGAGGAAGCCGCGAAGGTGCCGCATCCGTGCGCATCGCT
>A00183:529:HWT25DSXX:3:1101:15076:1579
AATACATGTCGAATTCGATCGGGTGCGTGGTCATGCGGAAGCGCTGCACCTCCTCGTACTTGAGCGCGATGTACGCATCGAGCATGTCGTTGGTGAAGACGCCGCCGCGCGTGAGGTACTCGCGATCGCGGTCGAGGTTCTCGAGCGCCT
>A00183:529:HWT25DSXX:3:1101:15112:1579
GGCGTCGTGGATCTGGGTGCCGACGCCGCCGATGATGTGGTCAGGCGCGGGGAGTTCGCCGGAGGCGATGAGGCTGCGGGTGTCGTCGAGGAGGCGGCCTGAGTTATAGACGAGGAGCGGGCGGGTGTCGGCGGGGAGGGCGGCCCAGGC
>A00183:529:HWT25DSXX:3:1101:15112:1579
GAGCGACCAAAAACCTTCCACCTCCGCGCCCACCCCCCACCCCCCCATCCGCCTCTTCAGCACCGATCTTGACGGCACCCACCTCGGCAACCACGAGGCCACCCAACCCATCGCCACCGCCAGGGCCGCCCAACCCCCCGACACCCCCCC
- split_1.fa长这样
>A00183:529:HWT25DSXX:3:1101:13612:1579
GACTGGGCCGACGGCGACGACTCGCCGGCCGCCTTCATCGAGGAGATTCCGCCCGCCGGCCGCCTGTTTTTCCGCCTGCAAACCCGCCCCGTCGCCATGCCCGCGAGCGCGCCCGAACACGCCGGCTTTTTTCTCACCTACGCCGACTTC
>A00183:529:HWT25DSXX:3:1101:14660:1579
TGGATGGGGAACAACCTGCCAAAAGCTGACCAGTTGTTCGCGGGCGAGCGTTGGGTACTGTGCGAATCGGCCGGGAATGTCGCGGGTACGAGTTGCACTGGTGACACGATTGGTAAGGCACAGTTGATCGCTGATCTCAGCCAGCGCTAC
>A00183:529:HWT25DSXX:3:1101:14733:1579
GGTGACGCCAGCTCCTTCATGGCGGCGGAGACCGCGCGCGGATCCCCGTCGCCCACGCGCGTCATCGCCAGCGCCATCAACCCCAGGTAGTCGGCGAAGGCGCTCAAGCGCGGCTCGGGCACCAACGACTCACCGAAACGAGCGTAGTAG
>A00183:529:HWT25DSXX:3:1101:15076:1579
GGCGAACCCCTACCTCGCTTTCGCCGCGCTCCTGATGGCGGGCCTCGACGGCGTGCAGAACAAGATCCATCCCGGCGACCCGCTCGACAAGAACCTCTACGACCACCCGGCCGAGGAAGCCGCGAAGGTGCCGCATCCGTGCGCATCGCT
>A00183:529:HWT25DSXX:3:1101:15112:1579
GGCGTCGTGGATCTGGGTGCCGACGCCGCCGATGATGTGGTCAGGCGCGGGGAGTTCGCCGGAGGCGATGAGGCTGCGGGTGTCGTCGAGGAGGCGGCCTGAGTTATAGACGAGGAGCGGGCGGGTGTCGGCGGGGAGGGCGGCCCAGGC
- split_2.fa长这样
>A00183:529:HWT25DSXX:3:1101:13612:1579
ATCCACGGCCACGGGGTCGTGGTGGGCGGCGAGCAGGGCGTGGAATTTTTGGTGCGCGCCGGCGAAGAGCGTCTCGTTGTGCCAAAGGGTGTCGTCGGCGTCGAAGCCGAGGGCGAGCGGTGCGTCGGTCATGGGCGAGGAAGTGCCCAG
>A00183:529:HWT25DSXX:3:1101:14660:1579
TATTCTGCGATGCCAGCCAAAACAGACCGAGAATCGGCGACTGTGTTCCAGTCTGCAAGCTCAACTTCTTAGCAGCATCTTTTAGATCGGCGTAGCGCAACACCTTTGCGTTCTTGAAGTACGTTCGCCACTGCGCGATGAAATCGTTCA
>A00183:529:HWT25DSXX:3:1101:14733:1579
TGATATTCCTGACCAGCGCCAACCTCTCCGCGCTGCTCGATACGGTCTTCTGGACGGAGCGCGCCGGCTACACCGGCCTGTATTATTTCGCCCACGGCGCGTTCCTCGCCAGCGTGAGCGCGTCGATGCCGGGCTCGCTCGTCCCGCAGC
>A00183:529:HWT25DSXX:3:1101:15076:1579
AATACATGTCGAATTCGATCGGGTGCGTGGTCATGCGGAAGCGCTGCACCTCCTCGTACTTGAGCGCGATGTACGCATCGAGCATGTCGTTGGTGAAGACGCCGCCGCGCGTGAGGTACTCGCGATCGCGGTCGAGGTTCTCGAGCGCCT
>A00183:529:HWT25DSXX:3:1101:15112:1579
GAGCGACCAAAAACCTTCCACCTCCGCGCCCACCCCCCACCCCCCCATCCGCCTCTTCAGCACCGATCTTGACGGCACCCACCTCGGCAACCACGAGGCCACCCAACCCATCGCCACCGCCAGGGCCGCCCAACCCCCCGACACCCCCCC
再试一个大文件:合并后19Gb,跑起来比较占内存,但也还算快~8min CPU时。建议没有足够RAM的拆分后再跑,然后合并
拆分教程参考:
http://www.biostack.org/?p=504
python ./split_merged_fa.py HF12RS1T2C.fa HF12RS1T2C_R1.fa HF12RS1T2C_R2.fa
Resource usage summary:
CPU time : 488.69 sec.
Max Memory : 67597 MB
Average Memory : 45730.43 MB
Total Requested Memory : 10000.00 MB
Delta Memory : -57597.00 MB
Max Swap : -
Max Processes : 4
Max Threads : 5
Run time : 1202 sec.
Turnaround time : 1302 sec.