原核生物基因岛预测
发表于:2021-10-08 | 分类: 生物信息
字数统计: 814 | 阅读时长: 4分钟 | 阅读量:

软件 (Software needed)

安装 (Installation)

conda install islandpath
conda install perl-bioperl

输入文件 (Input Files)

  • GenBank (.gbk) or an embl (.embl) file
  • 注意: 输入文件中只允许包含一条序列,否则会报错!(Please make sure you are running islandpath on a genbank file with only one contig)。如果一个文件中含有多个序列,那么就要将其分割成为多个文件,然后逐个作为输入文件进行预测。切割方法见我的另一篇文章按照 Contig 切割 GenBank 文件

运行软件 (Run)

常规运行

# gbk file
islandpath example/NC_003210.gbk NC_003210_GIs.txt

# embl file
islandpath example/NC_000913.embl NC_000913_GIs.txt

输出结果如下图所示:

示例输出结果展示

批处理

在得到大量基因组的时候,手动提交并不像打游戏那样让人渴望敲击键盘和鼠标,为了避免烦躁,我们来写脚本 “run_islandpath.pl”,然后让机器做剩下的事情。该脚本可以实现 GenBank 文件的切割,基因岛预测,以及结果的整合,实现了 IslandPath-DIMOB 所无法完成的分析。

#!/usr/bin/perl
use strict;
use warnings;
use Bio::SeqIO;
# Author: Liu hualin
# Date: Oct 8, 2021

# Split GenBank files
my @gbk = glob("*.gbk");# 批处理所有后缀为.gbk的文件
foreach  (@gbk) {
	$_=~/(.+).gbk/;
	my $str = $1;
	open IN, "$_" || die;
	local $/ = "LOCUS";
	<IN>;
	while (<IN>) {
		chomp;
		$_=~/\s+(\S+)/;
		my $scaf = $1;
		my $out = $str . "_" . $scaf . ".gb";
		my $assession = $str . "_" . $scaf;
		$_=~s/ACCESSION.+/ACCESSION   $assession/g;# 添加ACCESSION number
		open OUT, ">$out" || die;
		print OUT "LOCUS$_";
		close OUT;
	}
	close IN;
}


# predict Gene Islands
$/ = "\n";
my @gb = glob("*.gb");
foreach  (@gb) {
	$_=~/(.+).gb/;
	my $out = $1 . ".island";
	system("islandpath $_ $out");
}

# Get features from GenBank files
foreach my $gb (@gb) {
	$gb=~/(.+).gb/;
	my $str = $1;
	my $out = $str . ".list";
	my $seqin = Bio::SeqIO->new( -format => 'genbank', -file => "$gb");
	open OUT, ">$out" || die;
	while( (my $seq = $seqin->next_seq) ) {
		foreach my $sf ( $seq->get_SeqFeatures ) {
			if( $sf->primary_tag eq 'CDS' ) {
				my @tags = $sf ->get_all_tags();
				#print join("\t", @tags) . "\n";
				print OUT $sf->get_tag_values('locus_tag'), "\t", $sf->start, "\t", $sf->end, "\t", $sf->strand, "\t", $sf->get_tag_values('product'), "\t", $sf->get_tag_values('translation'),"\n";
			}
		}
	}
}

# Parser the results
my (%hash, %gi, %list, %gif);
open OUT, ">All_island.txt" || die;
print OUT "Sequence IDs	Predictor	Category	GI Start	GI End	.	Strand	.	Island IDs	Gene IDs	Gene Start	Gene End	Strand	Products	Protein Sequences\n";
open LIST, ">All_island.list" || die;
print LIST "Island IDs\tGI Start\tGI End\tGI Length\tGene Number\tGene IDs\n";
my @GI = glob("*.island");
foreach  (@GI) {
	$_=~/(.+).island/;
	my $list = $1 . ".list";
	open IN, "$_" || die;
	<IN>;
	while (<IN>) {
		chomp;
		$_=~s/[\r\n]+//g;
		my @lines = split /\t/;
		my $start = $lines[3];
		my $end = $lines[4];
		my $id = $lines[-1];
		my $gilen = $end - $start + 1;
		$hash{$id} = $start . "-" . $end . "-" . $id;
		$gi{$id} = $_;
		$gif{$id} = $start . "\t" . $end . "\t" . $gilen;
	}
	close IN;
	open GB, "$list" || die;
	while (<GB>) {
		chomp;
		my @line = split /\t/;
		foreach my $ids (sort keys %hash) {
			my ($start2, $end2, $gi) = split /\-/, $hash{$ids};
			if (($line[1] >= $start2) && ($line[2] <= $end2)) {
				print OUT $gi{$gi} . "\t" . $_ . "\n";
				push @{$list{$gi}}, $line[0];
			}
		}
	}
	close GB;
}
close OUT;

foreach  (sort keys %list) {
	print LIST $_ . "\t" . $gif{$_} . "\t" . @{$list{$_}} . "\t" . join (" ", @{$list{$_}}) . "\n";
}
close LIST;

将 “run_islandpath.pl” 与 GenBank 文件放在同一目录下,在终端里运行如下命令:

perl run_islandpath.pl

结果汇总于 All_island.txt All_island.list 中,内容如下面二图所示。

All_island.txt

All_island.list

脚本获取

关注公众号 “生信之巅”,聊天窗口回复 “5324” 获取下载链接。

生信之巅微信公众号生信之巅小程序码

敬告:使用文中脚本请引用本文网址,请尊重本人的劳动成果,谢谢!Notice: When you use the scripts in this article, please cite the link of this webpage. Thank you!

上一篇:
按照Contig切割GenBank文件
下一篇:
预测前噬菌体prophages
本文目录
本文目录