软件 (Software needed)

安装 (Installation)

1 2	conda install islandpath conda install perl-bioperl

输入文件 (Input Files)

GenBank (.gbk) or an embl (.embl) file
注意: 输入文件中只允许包含一条序列，否则会报错！(Please make sure you are running islandpath on a genbank file with only one contig)。如果一个文件中含有多个序列，那么就要将其分割成为多个文件，然后逐个作为输入文件进行预测。切割方法见我的另一篇文章按照Contig切割GenBank文件。

运行软件 (Run)

常规运行

# gbk file
islandpath example/NC_003210.gbk NC_003210_GIs.txt

# embl file
islandpath example/NC_000913.embl NC_000913_GIs.txt

输出结果如下图所示：

批处理

在得到大量基因组的时候，手动提交并不像打游戏那样让人渴望敲击键盘和鼠标，为了避免烦躁，我们来写脚本“run_islandpath.pl”，然后让机器做剩下的事情。该脚本可以实现GenBank文件的切割，基因岛预测，以及结果的整合，实现了IslandPath-DIMOB所无法完成的分析。

#!/usr/bin/perl
use strict;
use warnings;
use Bio::SeqIO;
# Author: Liu hualin
# Date: Oct 8, 2021

# Split GenBank files
my @gbk = glob("*.gbk");# 批处理所有后缀为.gbk的文件
foreach  (@gbk) {
	$_=~/(.+).gbk/;
	my $str = $1;
	open IN, "$_" || die;
	local $/ = "LOCUS";
	<IN>;
	while (<IN>) {
		chomp;
		$_=~/\s+(\S+)/;
		my $scaf = $1;
		my $out = $str . "_" . $scaf . ".gb";
		my $assession = $str . "_" . $scaf;
		$_=~s/ACCESSION.+/ACCESSION   $assession/g;# 添加ACCESSION number
		open OUT, ">$out" || die;
		print OUT "LOCUS$_";
		close OUT;
	}
	close IN;
}


# predict Gene Islands
$/ = "\n";
my @gb = glob("*.gb");
foreach  (@gb) {
	$_=~/(.+).gb/;
	my $out = $1 . ".island";
	system("islandpath $_ $out");
}

# Get features from GenBank files
foreach my $gb (@gb) {
	$gb=~/(.+).gb/;
	my $str = $1;
	my $out = $str . ".list";
	my $seqin = Bio::SeqIO->new( -format => 'genbank', -file => "$gb");
	open OUT, ">$out" || die;
	while( (my $seq = $seqin->next_seq) ) {
		foreach my $sf ( $seq->get_SeqFeatures ) {
			if( $sf->primary_tag eq 'CDS' ) {
				my @tags = $sf ->get_all_tags();
				#print join("\t", @tags) . "\n";
				print OUT $sf->get_tag_values('locus_tag'), "\t", $sf->start, "\t", $sf->end, "\t", $sf->strand, "\t", $sf->get_tag_values('product'), "\t", $sf->get_tag_values('translation'),"\n";
			}
		}
	}
}

# Parser the results
my (%hash, %gi, %list, %gif);
open OUT, ">All_island.txt" || die;
print OUT "Sequence IDs	Predictor	Category	GI Start	GI End	.	Strand	.	Island IDs	Gene IDs	Gene Start	Gene End	Strand	Products	Protein Sequences\n";
open LIST, ">All_island.list" || die;
print LIST "Island IDs\tGI Start\tGI End\tGI Length\tGene Number\tGene IDs\n";
my @GI = glob("*.island");
foreach  (@GI) {
	$_=~/(.+).island/;
	my $list = $1 . ".list";
	open IN, "$_" || die;
	<IN>;
	while (<IN>) {
		chomp;
		$_=~s/[\r\n]+//g;
		my @lines = split /\t/;
		my $start = $lines[3];
		my $end = $lines[4];
		my $id = $lines[-1];
		my $gilen = $end - $start + 1;
		$hash{$id} = $start . "-" . $end . "-" . $id;
		$gi{$id} = $_;
		$gif{$id} = $start . "\t" . $end . "\t" . $gilen;
	}
	close IN;
	open GB, "$list" || die;
	while (<GB>) {
		chomp;
		my @line = split /\t/;
		foreach my $ids (sort keys %hash) {
			my ($start2, $end2, $gi) = split /\-/, $hash{$ids};
			if (($line[1] >= $start2) && ($line[2] <= $end2)) {
				print OUT $gi{$gi} . "\t" . $_ . "\n";
				push @{$list{$gi}}, $line[0];
			}
		}
	}
	close GB;
}
close OUT;

foreach  (sort keys %list) {
	print LIST $_ . "\t" . $gif{$_} . "\t" . @{$list{$_}} . "\t" . join (" ", @{$list{$_}}) . "\n";
}
close LIST;

将“run_islandpath.pl”与GenBank文件放在同一目录下，在终端里运行如下命令：

1	perl run_islandpath.pl

结果汇总于All_island.txt和All_island.list中，内容如下面二图所示。

脚本获取

关注公众号“生信之巅”，聊天窗口回复“5324”获取下载链接。

敬告：使用文中脚本请引用本文网址，请尊重本人的劳动成果，谢谢！Notice: When you use the scripts in this article, please cite the link of this webpage. Thank you!

了尘兰若的小坑

分享微生物生物信息学分析方法，欢迎加入QQ群交流945751012，不接受群内广告！