加载中...
预测前噬菌体prophages
发表于:2021-10-08 |
字数统计: 1.2k | 阅读时长: 6分钟 | 阅读量:

本文介绍如何从细菌 / 古菌中预测前噬菌体。

软件选择

软件安装

安装主程序

  • 推荐使用 conda

    conda install phispy
    conda install perl-bioperl

下载模型

wget https://ftp.ncbi.nlm.nih.gov/pub/kristensen/pVOGs/downloads/All/AllvogHMMprofiles.tar.gz

tar -zxvf AllvogHMMprofiles.tar.gz

cat AllvogHMMprofiles/* > pVOGs.hmm

输入文件

运行软件

PhiSpy.py genbank_file -o output_directory --phage_genes 1 --color --threads 6 --phmms pVOGs.hmm --min_contig_size 5000 --output_choice 1
  • genbank file: 输入文件
  • output directory: 输出目录
  • --phage_genes: 区域内含有的被鉴定为噬菌体基因的最小数量。默认采用严格模式,即在每个前噬菌体区域必须鉴定得到两个或更多个 phage 基因。调高该参数的值将会减少预测到的前噬菌体的数量,反之,减小参数值,将会得到更多的移动元件。如果该参数设置为 0,将会预测 plasmids, integrons, and pathogenicity islands. Somewhat unexpectedly, it will also identify the ribosomal RNA operons as likely being mobile since they are unlike the host's backbone!
  • --color: 根据 CDs 的功能对其着色,使用 Artemis 查看
  • --threads: 线程数
  • --min_contig_size: 低于阈值的序列将被忽略,不对其进行预测
  • --output_choice: 控制输出的文件类型,详见官网

报错处理

  • 由于序列 ID 引发的错误

    • 错误信息如下:
    Traceback (most recent call last):
    File "/home/liu/miniconda3/envs/component/bin/PhiSpy.py", line 10, in <module>
      sys.exit(run())
    File "/home/liu/miniconda3/envs/component/lib/python3.7/site-packages/PhiSpyModules/main.py", line 122, in run
      main(sys.argv)
    File "/home/liu/miniconda3/envs/component/lib/python3.7/site-packages/PhiSpyModules/main.py", line 44, in main
      args_parser.record = PhiSpyModules.SeqioFilter(filter(lambda x: len(x.seq) > args_parser.min_contig_size, SeqIO.parse(handle, "genbank")))
    File "/home/liu/miniconda3/envs/component/lib/python3.7/site-packages/PhiSpyModules/seqio_filter.py", line 33, in __init__
      for n, item in enumerate(content):
    File "/home/liu/miniconda3/envs/component/lib/python3.7/site-packages/Bio/SeqIO/Interfaces.py", line 74, in __next__
      return next(self.records)
    File "/home/liu/miniconda3/envs/component/lib/python3.7/site-packages/Bio/GenBank/Scanner.py", line 516, in parse_records
      record = self.parse(handle, do_features)
    File "/home/liu/miniconda3/envs/component/lib/python3.7/site-packages/Bio/GenBank/Scanner.py", line 499, in parse
      if self.feed(handle, consumer, do_features):
    File "/home/liu/miniconda3/envs/component/lib/python3.7/site-packages/Bio/GenBank/Scanner.py", line 465, in feed
      self._feed_first_line(consumer, self.line)
    File "/home/liu/miniconda3/envs/component/lib/python3.7/site-packages/Bio/GenBank/Scanner.py", line 1571, in _feed_first_line
      raise ValueError("Did not recognise the LOCUS line layout:\n" + line)
    ValueError: Did not recognise the LOCUS line layout:
    LOCUS       NODE_52_length_15591_cov_14.37480715591 bp
    • 解决方案
      • 将 sequence ID 缩短

批处理并整合结果

撰写脚本 “run_PhiSpy.pl”。

#!/usr/bin/perl
use strict;
use warnings;
use Bio::SeqIO;
# Author: Liu hualin
# Date: Oct 12, 2021

my @gbk = glob("*.gbk");
foreach  (@gbk) {
	$_=~/(.+).gbk/;
	my $out = $1 . "_prophage";
	system("PhiSpy.py $_ -o $out --phage_genes 1 --min_contig_size 5000 --output_choice 1 --color --phmms pVOGs.hmm --threads 8");
}

open OUT, ">All.prophages.txt" || die;# print prophages informations
print OUT "Strain\tPhage ID\tContig ID\tStart location of the prophage\tStop location of the prophage\tStart of attL\tEnd of attL\tStart of attR\tEnd of attR\tsequence of attL\tSequence of attR\tWhy this att site was chosen for this prophage\n";
# attachment (att) sites
open SEQ, ">All.prophages.seq" || die;# print prophages sequences
print SEQ "Strain\tPhage ID\tContig ID\tGene Start\tGene End\tStrand\tAnnotation\tProtein sequences\n";
my @result = glob("*_prophage");
foreach  (@result) {
	$_=~/(.+)_prophage/;
	my $str = $1;
	my $prophage = $_ . "/prophage_coordinates.tsv";
	if (! -z $prophage) {
		open IN, "$prophage" || die;
		while (<IN>) {
			chomp;
			my @lines = split /\t/;
			my $contig = $lines[1];
			my $gbk = $str . ".gbk";
			my $seqin = Bio::SeqIO->new( -format => 'genbank', -file => "$gbk");#需要在gbk文件所在的目录中运行命令!
			while( (my $seq = $seqin->next_seq) ) {
				foreach my $sf ( $seq->get_SeqFeatures ) {
					if ($seq->display_name eq $contig) {
						if( $sf->primary_tag eq 'CDS' ) {
							#print SEQ $str, "\t", $lines[0], "\t", $seq->display_name, "\t", $sf->start, "\t", $sf->end, "\t", $sf->strand, "\t", $sf->get_tag_values('product'), "\t", $sf->get_tag_values('translation'), $seq->seq,"\n";# Also print contig sequences
							print SEQ $str, "\t", $lines[0], "\t", $seq->display_name, "\t", $sf->start, "\t", $sf->end, "\t", $sf->strand, "\t", $sf->get_tag_values('product'), "\t", $sf->get_tag_values('translation'),"\n";# Only print Protein sequences
						}
					}
				}
			}
			print OUT $str . "\t" . $_ . "\n";
		}
		close IN;
	}
}
close OUT;
close SEQ;

将 “run_PhiSpy.pl” 和后缀名为 “.gbk” 的 GenBank 格式的文件,以及 “pVOGs.hmm” 放在同一目录下,运行如下命令:

perl run_PhiSpy.pl

得到两个汇总文件:

  • All.prophages.txt: 包含 prophage 信息的文件
  • All.prophages.seq: 包含 prophage 序列的文件

All.prophages.txt

All.prophages.seq

脚本获取

关注公众号 “生信之巅”,聊天窗口回复 “53f4” 获取下载链接。

生信之巅微信公众号生信之巅小程序码

敬告:使用文中脚本请引用本文网址,请尊重本人的劳动成果,谢谢!Notice: When you use the scripts in this article, please cite the link of this webpage. Thank you!

参考

上一篇:
原核生物基因岛预测
下一篇:
Hexo渲染异常
本文目录
本文目录