本文介绍如何从细菌 / 古菌中预测前噬菌体。
软件选择
软件安装
安装主程序
推荐使用 conda
conda install phispy conda install perl-bioperl
下载模型
wget https://ftp.ncbi.nlm.nih.gov/pub/kristensen/pVOGs/downloads/All/AllvogHMMprofiles.tar.gz
tar -zxvf AllvogHMMprofiles.tar.gz
cat AllvogHMMprofiles/* > pVOGs.hmm
输入文件
- 需要 GenBank 格式的文件,可通过 RAST server 或 PROKKA 获得。
运行软件
PhiSpy.py genbank_file -o output_directory --phage_genes 1 --color --threads 6 --phmms pVOGs.hmm --min_contig_size 5000 --output_choice 1
- genbank file: 输入文件
- output directory: 输出目录
- --phage_genes: 区域内含有的被鉴定为噬菌体基因的最小数量。默认采用严格模式,即在每个前噬菌体区域必须鉴定得到两个或更多个 phage 基因。调高该参数的值将会减少预测到的前噬菌体的数量,反之,减小参数值,将会得到更多的移动元件。如果该参数设置为 0,将会预测 plasmids, integrons, and pathogenicity islands. Somewhat unexpectedly, it will also identify the ribosomal RNA operons as likely being mobile since they are unlike the host's backbone!
- --color: 根据 CDs 的功能对其着色,使用 Artemis 查看
- --threads: 线程数
- --min_contig_size: 低于阈值的序列将被忽略,不对其进行预测
- --output_choice: 控制输出的文件类型,详见官网
报错处理
由于序列 ID 引发的错误
- 错误信息如下:
Traceback (most recent call last): File "/home/liu/miniconda3/envs/component/bin/PhiSpy.py", line 10, in <module> sys.exit(run()) File "/home/liu/miniconda3/envs/component/lib/python3.7/site-packages/PhiSpyModules/main.py", line 122, in run main(sys.argv) File "/home/liu/miniconda3/envs/component/lib/python3.7/site-packages/PhiSpyModules/main.py", line 44, in main args_parser.record = PhiSpyModules.SeqioFilter(filter(lambda x: len(x.seq) > args_parser.min_contig_size, SeqIO.parse(handle, "genbank"))) File "/home/liu/miniconda3/envs/component/lib/python3.7/site-packages/PhiSpyModules/seqio_filter.py", line 33, in __init__ for n, item in enumerate(content): File "/home/liu/miniconda3/envs/component/lib/python3.7/site-packages/Bio/SeqIO/Interfaces.py", line 74, in __next__ return next(self.records) File "/home/liu/miniconda3/envs/component/lib/python3.7/site-packages/Bio/GenBank/Scanner.py", line 516, in parse_records record = self.parse(handle, do_features) File "/home/liu/miniconda3/envs/component/lib/python3.7/site-packages/Bio/GenBank/Scanner.py", line 499, in parse if self.feed(handle, consumer, do_features): File "/home/liu/miniconda3/envs/component/lib/python3.7/site-packages/Bio/GenBank/Scanner.py", line 465, in feed self._feed_first_line(consumer, self.line) File "/home/liu/miniconda3/envs/component/lib/python3.7/site-packages/Bio/GenBank/Scanner.py", line 1571, in _feed_first_line raise ValueError("Did not recognise the LOCUS line layout:\n" + line) ValueError: Did not recognise the LOCUS line layout: LOCUS NODE_52_length_15591_cov_14.37480715591 bp
- 解决方案
- 将 sequence ID 缩短
批处理并整合结果
撰写脚本 “run_PhiSpy.pl”。
#!/usr/bin/perl
use strict;
use warnings;
use Bio::SeqIO;
# Author: Liu hualin
# Date: Oct 12, 2021
my @gbk = glob("*.gbk");
foreach (@gbk) {
$_=~/(.+).gbk/;
my $out = $1 . "_prophage";
system("PhiSpy.py $_ -o $out --phage_genes 1 --min_contig_size 5000 --output_choice 1 --color --phmms pVOGs.hmm --threads 8");
}
open OUT, ">All.prophages.txt" || die;# print prophages informations
print OUT "Strain\tPhage ID\tContig ID\tStart location of the prophage\tStop location of the prophage\tStart of attL\tEnd of attL\tStart of attR\tEnd of attR\tsequence of attL\tSequence of attR\tWhy this att site was chosen for this prophage\n";
# attachment (att) sites
open SEQ, ">All.prophages.seq" || die;# print prophages sequences
print SEQ "Strain\tPhage ID\tContig ID\tGene Start\tGene End\tStrand\tAnnotation\tProtein sequences\n";
my @result = glob("*_prophage");
foreach (@result) {
$_=~/(.+)_prophage/;
my $str = $1;
my $prophage = $_ . "/prophage_coordinates.tsv";
if (! -z $prophage) {
open IN, "$prophage" || die;
while (<IN>) {
chomp;
my @lines = split /\t/;
my $contig = $lines[1];
my $gbk = $str . ".gbk";
my $seqin = Bio::SeqIO->new( -format => 'genbank', -file => "$gbk");#需要在gbk文件所在的目录中运行命令!
while( (my $seq = $seqin->next_seq) ) {
foreach my $sf ( $seq->get_SeqFeatures ) {
if ($seq->display_name eq $contig) {
if( $sf->primary_tag eq 'CDS' ) {
#print SEQ $str, "\t", $lines[0], "\t", $seq->display_name, "\t", $sf->start, "\t", $sf->end, "\t", $sf->strand, "\t", $sf->get_tag_values('product'), "\t", $sf->get_tag_values('translation'), $seq->seq,"\n";# Also print contig sequences
print SEQ $str, "\t", $lines[0], "\t", $seq->display_name, "\t", $sf->start, "\t", $sf->end, "\t", $sf->strand, "\t", $sf->get_tag_values('product'), "\t", $sf->get_tag_values('translation'),"\n";# Only print Protein sequences
}
}
}
}
print OUT $str . "\t" . $_ . "\n";
}
close IN;
}
}
close OUT;
close SEQ;
将 “run_PhiSpy.pl” 和后缀名为 “.gbk” 的 GenBank 格式的文件,以及 “pVOGs.hmm” 放在同一目录下,运行如下命令:
perl run_PhiSpy.pl
得到两个汇总文件:
- All.prophages.txt: 包含 prophage 信息的文件
- All.prophages.seq: 包含 prophage 序列的文件
脚本获取
关注公众号 “生信之巅”,聊天窗口回复 “53f4” 获取下载链接。
![]() | ![]() |
敬告:使用文中脚本请引用本文网址,请尊重本人的劳动成果,谢谢!