原核生物基因岛预测
发表于:2021-10-08 | 分类: 生物信息
字数统计: 923 | 阅读时长: 4分钟 | 阅读量:

软件 (Software needed)

安装 (Installation)

1
2
conda install islandpath
conda install perl-bioperl

输入文件 (Input Files)

  • GenBank (.gbk) or an embl (.embl) file
  • 注意: 输入文件中只允许包含一条序列,否则会报错!(Please make sure you are running islandpath on a genbank file with only one contig)。如果一个文件中含有多个序列,那么就要将其分割成为多个文件,然后逐个作为输入文件进行预测。切割方法见我的另一篇文章按照Contig切割GenBank文件

运行软件 (Run)

常规运行

1
2
3
4
5
# gbk file
islandpath example/NC_003210.gbk NC_003210_GIs.txt

# embl file
islandpath example/NC_000913.embl NC_000913_GIs.txt

输出结果如下图所示:

示例输出结果展示

批处理

在得到大量基因组的时候,手动提交并不像打游戏那样让人渴望敲击键盘和鼠标,为了避免烦躁,我们来写脚本“run_islandpath.pl”,然后让机器做剩下的事情。该脚本可以实现GenBank文件的切割,基因岛预测,以及结果的整合,实现了IslandPath-DIMOB所无法完成的分析。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
#!/usr/bin/perl
use strict;
use warnings;
use Bio::SeqIO;
# Author: Liu hualin
# Date: Oct 8, 2021

# Split GenBank files
my @gbk = glob("*.gbk");# 批处理所有后缀为.gbk的文件
foreach (@gbk) {
$_=~/(.+).gbk/;
my $str = $1;
open IN, "$_" || die;
local $/ = "LOCUS";
<IN>;
while (<IN>) {
chomp;
$_=~/\s+(\S+)/;
my $scaf = $1;
my $out = $str . "_" . $scaf . ".gb";
my $assession = $str . "_" . $scaf;
$_=~s/ACCESSION.+/ACCESSION $assession/g;# 添加ACCESSION number
open OUT, ">$out" || die;
print OUT "LOCUS$_";
close OUT;
}
close IN;
}


# predict Gene Islands
$/ = "\n";
my @gb = glob("*.gb");
foreach (@gb) {
$_=~/(.+).gb/;
my $out = $1 . ".island";
system("islandpath $_ $out");
}

# Get features from GenBank files
foreach my $gb (@gb) {
$gb=~/(.+).gb/;
my $str = $1;
my $out = $str . ".list";
my $seqin = Bio::SeqIO->new( -format => 'genbank', -file => "$gb");
open OUT, ">$out" || die;
while( (my $seq = $seqin->next_seq) ) {
foreach my $sf ( $seq->get_SeqFeatures ) {
if( $sf->primary_tag eq 'CDS' ) {
my @tags = $sf ->get_all_tags();
#print join("\t", @tags) . "\n";
print OUT $sf->get_tag_values('locus_tag'), "\t", $sf->start, "\t", $sf->end, "\t", $sf->strand, "\t", $sf->get_tag_values('product'), "\t", $sf->get_tag_values('translation'),"\n";
}
}
}
}

# Parser the results
my (%hash, %gi, %list, %gif);
open OUT, ">All_island.txt" || die;
print OUT "Sequence IDs Predictor Category GI Start GI End . Strand . Island IDs Gene IDs Gene Start Gene End Strand Products Protein Sequences\n";
open LIST, ">All_island.list" || die;
print LIST "Island IDs\tGI Start\tGI End\tGI Length\tGene Number\tGene IDs\n";
my @GI = glob("*.island");
foreach (@GI) {
$_=~/(.+).island/;
my $list = $1 . ".list";
open IN, "$_" || die;
<IN>;
while (<IN>) {
chomp;
$_=~s/[\r\n]+//g;
my @lines = split /\t/;
my $start = $lines[3];
my $end = $lines[4];
my $id = $lines[-1];
my $gilen = $end - $start + 1;
$hash{$id} = $start . "-" . $end . "-" . $id;
$gi{$id} = $_;
$gif{$id} = $start . "\t" . $end . "\t" . $gilen;
}
close IN;
open GB, "$list" || die;
while (<GB>) {
chomp;
my @line = split /\t/;
foreach my $ids (sort keys %hash) {
my ($start2, $end2, $gi) = split /\-/, $hash{$ids};
if (($line[1] >= $start2) && ($line[2] <= $end2)) {
print OUT $gi{$gi} . "\t" . $_ . "\n";
push @{$list{$gi}}, $line[0];
}
}
}
close GB;
}
close OUT;

foreach (sort keys %list) {
print LIST $_ . "\t" . $gif{$_} . "\t" . @{$list{$_}} . "\t" . join (" ", @{$list{$_}}) . "\n";
}
close LIST;

将“run_islandpath.pl”与GenBank文件放在同一目录下,在终端里运行如下命令:

1
perl run_islandpath.pl

结果汇总于All_island.txtAll_island.list中,内容如下面二图所示。

All_island.txt

All_island.list

脚本获取

关注公众号“生信之巅”,聊天窗口回复“5324”获取下载链接。

生信之巅微信公众号 生信之巅小程序码

敬告:使用文中脚本请引用本文网址,请尊重本人的劳动成果,谢谢!Notice: When you use the scripts in this article, please cite the link of this webpage. Thank you!

上一篇:
按照Contig切割GenBank文件
下一篇:
预测前噬菌体prophages