软件 (Software needed)
安装 (Installation)
conda install islandpath
conda install perl-bioperl
输入文件 (Input Files)
- GenBank (.gbk) or an embl (.embl) file
- 注意: 输入文件中只允许包含一条序列,否则会报错!(Please make sure you are running islandpath on a genbank file with only one contig)。如果一个文件中含有多个序列,那么就要将其分割成为多个文件,然后逐个作为输入文件进行预测。切割方法见我的另一篇文章按照 Contig 切割 GenBank 文件。
运行软件 (Run)
常规运行
# gbk file
islandpath example/NC_003210.gbk NC_003210_GIs.txt
# embl file
islandpath example/NC_000913.embl NC_000913_GIs.txt
输出结果如下图所示:
批处理
在得到大量基因组的时候,手动提交并不像打游戏那样让人渴望敲击键盘和鼠标,为了避免烦躁,我们来写脚本 “run_islandpath.pl”,然后让机器做剩下的事情。该脚本可以实现 GenBank 文件的切割,基因岛预测,以及结果的整合,实现了 IslandPath-DIMOB 所无法完成的分析。
#!/usr/bin/perl
use strict;
use warnings;
use Bio::SeqIO;
# Author: Liu hualin
# Date: Oct 8, 2021
# Split GenBank files
my @gbk = glob("*.gbk");# 批处理所有后缀为.gbk的文件
foreach (@gbk) {
$_=~/(.+).gbk/;
my $str = $1;
open IN, "$_" || die;
local $/ = "LOCUS";
<IN>;
while (<IN>) {
chomp;
$_=~/\s+(\S+)/;
my $scaf = $1;
my $out = $str . "_" . $scaf . ".gb";
my $assession = $str . "_" . $scaf;
$_=~s/ACCESSION.+/ACCESSION $assession/g;# 添加ACCESSION number
open OUT, ">$out" || die;
print OUT "LOCUS$_";
close OUT;
}
close IN;
}
# predict Gene Islands
$/ = "\n";
my @gb = glob("*.gb");
foreach (@gb) {
$_=~/(.+).gb/;
my $out = $1 . ".island";
system("islandpath $_ $out");
}
# Get features from GenBank files
foreach my $gb (@gb) {
$gb=~/(.+).gb/;
my $str = $1;
my $out = $str . ".list";
my $seqin = Bio::SeqIO->new( -format => 'genbank', -file => "$gb");
open OUT, ">$out" || die;
while( (my $seq = $seqin->next_seq) ) {
foreach my $sf ( $seq->get_SeqFeatures ) {
if( $sf->primary_tag eq 'CDS' ) {
my @tags = $sf ->get_all_tags();
#print join("\t", @tags) . "\n";
print OUT $sf->get_tag_values('locus_tag'), "\t", $sf->start, "\t", $sf->end, "\t", $sf->strand, "\t", $sf->get_tag_values('product'), "\t", $sf->get_tag_values('translation'),"\n";
}
}
}
}
# Parser the results
my (%hash, %gi, %list, %gif);
open OUT, ">All_island.txt" || die;
print OUT "Sequence IDs Predictor Category GI Start GI End . Strand . Island IDs Gene IDs Gene Start Gene End Strand Products Protein Sequences\n";
open LIST, ">All_island.list" || die;
print LIST "Island IDs\tGI Start\tGI End\tGI Length\tGene Number\tGene IDs\n";
my @GI = glob("*.island");
foreach (@GI) {
$_=~/(.+).island/;
my $list = $1 . ".list";
open IN, "$_" || die;
<IN>;
while (<IN>) {
chomp;
$_=~s/[\r\n]+//g;
my @lines = split /\t/;
my $start = $lines[3];
my $end = $lines[4];
my $id = $lines[-1];
my $gilen = $end - $start + 1;
$hash{$id} = $start . "-" . $end . "-" . $id;
$gi{$id} = $_;
$gif{$id} = $start . "\t" . $end . "\t" . $gilen;
}
close IN;
open GB, "$list" || die;
while (<GB>) {
chomp;
my @line = split /\t/;
foreach my $ids (sort keys %hash) {
my ($start2, $end2, $gi) = split /\-/, $hash{$ids};
if (($line[1] >= $start2) && ($line[2] <= $end2)) {
print OUT $gi{$gi} . "\t" . $_ . "\n";
push @{$list{$gi}}, $line[0];
}
}
}
close GB;
}
close OUT;
foreach (sort keys %list) {
print LIST $_ . "\t" . $gif{$_} . "\t" . @{$list{$_}} . "\t" . join (" ", @{$list{$_}}) . "\n";
}
close LIST;
将 “run_islandpath.pl” 与 GenBank 文件放在同一目录下,在终端里运行如下命令:
perl run_islandpath.pl
结果汇总于 All_island.txt 和 All_island.list 中,内容如下面二图所示。
脚本获取
关注公众号 “生信之巅”,聊天窗口回复 “5324” 获取下载链接。
![]() | ![]() |
敬告:使用文中脚本请引用本文网址,请尊重本人的劳动成果,谢谢!