计算蛋白质等电点并绘制全局pI图
发表于:2022-01-03 | 分类: 生物信息
字数统计: 1.2k | 阅读时长: 6分钟 | 阅读量:

蛋白质组的全局 pIs

细胞全局蛋白质组 pI 图的变化取决于氨基酸的总电荷,并对蛋白质的结构和特性具有重要意义。

普遍认为原核基因组具有两个最大的双峰形状,一个在酸性pH值下主要对应于溶解的蛋白质(细胞质蛋白或分泌蛋白),另一种在膜蛋白的碱性pH值下,具有细胞内碱性(带正电荷)结构域以促进质子动力的产生。在这两个峰之间,有一个最小的中性值,对应于细胞内的pH值(如下图)。

蛋白质氨基酸组成和 pI 水平的显着变化提供了一种工具来预测培养物或宏基因组组装基因组(MAG)的首选栖息地。

蛋白质组的全局 pIs
Pedro J. et al., 2019, Microbiome

安装EMBOSS

  • 下载
1
wget  ftp://emboss.open-bio.org/pub/EMBOSS/emboss-latest.tar.gz
  • 安装
1
2
3
4
5
6
7
8
9
10
11
# 解压
tar zxvf emboss-latest.tar.gz

# 防止报错 (error while loading shared libraries: libnucleus.so.6)
sudo /sbin/ldconfig

# 配置、编译与安装
cd emboss-latest
./configure
make
sudo make install

输入文件

输入文件为含有一条或多条氨基酸序列的FASTA格式文件。

计算氨基酸序列的各特征数据Calculate statistics of protein properties

逐个文件计算

1
pepstats -sequence F01_bin.1.faa -outfile F01_bin.1.out

:::primary 参数解析

  • Standard (Mandatory) qualifiers:

    • [-sequence] seqall Protein sequence(s) filename and optional
      format, or reference (input USA)
    • [-outfile] outfile [*.pepstats] Pepstats program output file
  • Advanced (Unprompted) qualifiers:

    • -aadata datafile [Eamino.dat] Amino acid properties
    • -mwdata datafile [Emolwt.dat] Molecular weight data for amino
      acids
    • -pkdata datafile [Epk.dat] Values of pKa for amino acids
    • -[no]termini boolean [Y] Include charge at N and C terminus
    • -mono boolean [N] Use monoisotopic weights

:::

批量计算与pI提取并输出为相对丰度

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
#!/usr/bin/perl
use strict;
use warnings;
# name: print_pI.pl
# Author: Liu Hualin
# Date: 2022.01.03

my @genome = glob("*.faa");
foreach (@genome) {
$_=~/(\S+).faa/;
my $out = $1 . ".pepstats";
my $pi = $1 . ".pI";
system("pepstats -sequence $_ -outfile $out");

my %hash;
my $seqnum = 0;
open IN, "$out" || die;
while (<IN>) {
chomp;
if (/^(Isoelectric Point = )(\S+)/) {
my $pi = sprintf "%.1f", $2;
$hash{$pi}++;
$seqnum++;
}
}
close IN;

#my @records = values %hash;
#my $seqnum = @records;
#print $seqnum . "\n";
open OUT, ">$pi" || die;
print OUT "Isoelectric point\tRelative frequency\n";
foreach (keys %hash) {
my $frequency = sprintf "%.2f", $hash{$_}/$seqnum;#@records;
print OUT "$_\t$frequency\n";
}
close OUT;
}

选择性忽略 (这是我自己用的)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
#!/usr/bin/perl
use strict;
use warnings;
# name: co_sample_pI.pl
# Author: Liu Hualin
# Date: 2022.01.03

open OUT, ">F06.pi" || die;
print OUT "MAGs\tIsoelectric point\tRelative frequency\n";
my @pI = glob("F06*.pI");
foreach (@pI) {
$_=~/(\S+).pI/;
my $mag = $1;
open IN, "$_" || die;
<IN>;
while (<IN>) {
chomp;
print OUT "$mag\t$_\n";
}
close IN;
}
close OUT;

可视化

绘制蛋白质组的全局 pIs图

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
# Step 1 读入数据
setwd("E:/Researches/Xiaqian/NGS/CleanData/宏基因组数据/Result/Results/Annotations/AAs")

F1 <-read.table("F01_bin.1.pI", sep = "\t", header = T)

# Step 2 绘图,以下5选1
## 单组加点,运行完后跳到Step 3
p1 <- ggplot(F1, aes(Isoelectric.point, Relative.frequency))+geom_point(size=2,shape=21)+geom_smooth(method= "gam")+theme_classic()

## 单组无点,运行完后跳到Step 3
p1 <- ggplot(F1, aes(Isoelectric.point, Relative.frequency))+geom_smooth(method= "gam")+theme_classic()


## 多组加点
p1 <- ggplot(F1, aes(Isoelectric.point, Relative.frequency, color=MAGs))+geom_point(size=3,shape=21)+geom_smooth(method= "gam")+theme_classic()


## 多组无点显示置信区间
p1 <- ggplot(F1, aes(Isoelectric.point, Relative.frequency, color=MAGs))+geom_smooth(method= "gam")+theme_classic()

## 多组无点去除置信区间
p1 <- ggplot(F1, aes(Isoelectric.point, Relative.frequency, color=MAGs))+geom_smooth(method= "gam",se = FALSE)+theme_classic()


# Step 3 美化,看下图
p1 + scale_x_continuous(limits=c(2, 14), breaks = seq(2, 14, 1))+ scale_y_continuous(limits=c(-0.01, max(F1$Relative.frequency)), breaks = seq(0,max(F1$Relative.frequency), 0.01))+ labs(x= "Isoelectric point",y = "Relative frequency")

## ==============以下代码为自用========================
## F04自定义颜色
p1 + scale_x_continuous(limits=c(2, 14), breaks = seq(2, 14, 1))+ scale_y_continuous(limits=c(-0.01, max(F1$Relative.frequency)), breaks = seq(0,max(F1$Relative.frequency), 0.01))+ labs(x= "Isoelectric point",y = "Relative frequency")+ scale_color_manual(name="MAGs", values=c("#1f77b4","#ff7f0e","#2ca02c", "#d62728", "#9467bd", "#8c564b", "#e377c2", "#bcbd22"))

## F05自定义颜色
p1 + scale_x_continuous(limits=c(2, 14), breaks = seq(2, 14, 1))+ scale_y_continuous(limits=c(-0.01, max(F1$Relative.frequency)), breaks = seq(0,max(F1$Relative.frequency), 0.01))+ labs(x= "Isoelectric point",y = "Relative frequency")+ scale_color_manual(name="MAGs", values=c("#1f77b4","#ff7f0e","#2ca02c", "#d62728", "#9467bd", "#8c564b", "#e377c2", "#bcbd22", "#17becf"))

## F06自定义颜色
p1 + scale_x_continuous(limits=c(2, 14), breaks = seq(2, 14, 1))+ scale_y_continuous(limits=c(-0.01, max(F1$Relative.frequency)), breaks = seq(0,max(F1$Relative.frequency), 0.01))+ labs(x= "Isoelectric point",y = "Relative frequency")+ scale_color_manual(name="MAGs", values=c("#1f77b4","#ff7f0e","#2ca02c", "#d62728", "#9467bd", "#8c564b", "#e377c2", "#bcbd22", "#17becf", "#f7b6d2", "#5254a3", "#000000"))

蛋白质组的全局 pIs

参考

代码获取

关注公众号“生信之巅”,聊天窗口回复“85d7”获取下载链接。

生信之巅微信公众号 生信之巅小程序码

敬告:使用文中脚本请引用本文网址,请尊重本人的劳动成果,谢谢!Notice: When you use the scripts in this article, please cite the link of this webpage. Thank you!

上一篇:
使用DeepARG预测抗生素抗性基因ARGs
下一篇:
利用GTDB-TK对细菌和古菌基因组进行物种分类