利用NCycDB数据库从宏基因组中预测氮循环基因
发表于:2021-11-25 | 分类: 生物信息
字数统计: 1.6k | 阅读时长: 7分钟 | 阅读量:

氮(N)循环是地球生态系统中重要的生物地球化学途径的集合,在生态学和环境研究中得到了广泛的关注。目前,鸟枪法宏基因组测序Shotgun metagenome sequencing已被广泛应用于探索负责 N 循环过程的基因家族。NCycDB是一个手动管理的综合数据库,用于从鸟枪法宏基因组测序数据中快速准确地分析 N 循环基因(亚)家族。 NCycDB 总共包含 68 个基因(亚)家族,涵盖 8 个 N 循环过程,分别具有 95% 和 100% 一致性阈值的 84 759 和 219 146 个代表性序列。数据库中还包含了1958 个同源直系同源组Homologous orthology groups的序列,以避免由于“小数据库”问题导致的假阳性分配。

数据库及脚本

下载

  • 通过Git

    1
    git clone https://github.com/liaochenlanruo/NCyc.git
  • 通过wget

    1
    wget https://github.com/liaochenlanruo/NCyc/archive/refs/heads/master.zip

配置

  • 通过Git下载的不需要解压,通过Wget下载的需要先解压。
  • 修改NCycProfilter.PL文件的第8-13行中4个依赖软件的安装路径。
  • data目录下的NCyc_100_2019Jul.7z解压,将解压得到的NCyc_100_2019Jul重命名为NCyc_100.faa并移动至data目录下。

依赖

  • Blast
  • diamond
  • usearch
  • Perl

输入文件

  • 序列文件
    宏基因组测序得到的Reads文件、组装后的序列文件以及通过基因预测后得到的氨基酸序列文件均可。序列文件可以是压缩的,也可以是解压的。

  • 基因组-序列数对应文件
    提供一份文本文档,共包含两列,第一列为样本名称(即序列文件的名字,不带文件后缀名),第二列为样本包含的序列数量

预测

1
perl NCycProfiler.PL -d ./ -m diamond -f faa -s prot -si SI.txt -o Ncycle.out.txt

:::info
参数解析:
:::

-d 指定工作目录,即序列文件所在目录。
-m 指定用哪个软件进行序列比对,可选diamondblastusearch
-f 指定序列文件的后缀名,不需要带.
-s 指定序列类型,氨基酸为prot,核苷酸为nucl
-si 基因组-序列数对应文件
-rs 随机取样大小,如果不指定,将取包含序列最少的样本的序列数
-o 指定输出的文件名称

结果解析

得到的结果文件是一个表格,第一行为随机取样大小。第一列为参与N循环的基因名,其他列为各样本含有的对应基因的数量。

可视化

可参考本站另一篇文章R语言绘制气泡图Bubb_Plot进行数据可视化。

  • 不带分组
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
setwd("E:/Researches/Xiaqian/NGS/CleanData/宏基因组数据/Result/NCyc")
data <- read.table("Ncycle.out.txt",header = TRUE, sep = "\t")

library(ggplot2)
library(reshape)
data_melt <- melt(data)
names(data_melt) = c("Genes", "Samples", "Abundances")
data_melt <-as.data.frame(data_melt)

# 做主图
bubble <- ggplot(data_melt[which(data_melt$Abundances>0),], aes(x = Samples, y = Genes, size = Abundances, color = Samples)) + geom_point()

# 修改细节 — 图注,点大小,点shape
bubble_style <- bubble + theme_classic()+
labs(
x = "Sediment layers",
y = "N cycling genes",
color="Samples", # 颜色图注名
size="Abundances")+ # 大小图注名
scale_size(range = c(0.1, 10), breaks = seq(0.1, 0.6, 0.2)) + #等比修改圆圈大小
theme(plot.title=element_text(family="Arial",size=8,
color="red",face="italic",
hjust=0.5,lineheight=0.5),
plot.subtitle = element_text(hjust = 0.5)) +
theme(axis.text.x = element_text(angle = 0, hjust = 0.5))

bubble_style

依据下表获得基因的PathwaysAnnotation,随后依据Pathways进行分组并绘图。

Pathways Gene (sub) families Annotation
Nitrification amoA_A Ammonia monooxygenase subunit A (archaea)
^^ amoB_A Ammonia monooxygenase subunit B (archaea)
^^ amoC_A Ammonia monooxygenase subunit C (archaea)
^^ amoA_B Ammonia monooxygenase subunit A (bacteria)
^^ amoB_B Ammonia monooxygenase subunit B (bacteria)
^^ amoC_B Ammonia monooxygenase subunit C (bacteria)
^^ hao Hydroxylamine dehydrogenase
^^ nxrA Nitrite oxidoreductase, alpha subunit
^^ nxrB Nitrite oxidoreductase, beta subunit
Denitrification napA Periplasmic nitrate reductase NapA
^^ napB Cytochrome c-type protein NapB
^^ napC Cytochrome c-type protein NapC
^^ narG Nitrate reductase
^^ narH Nitrate reductase
^^ narJ Nitrate reductase molybdenum cofactor assembly chaperone
^^ narI Nitrate reductase gamma subunit
^^ nirK Nitrite reductase (NO-forming)
^^ nirS Nitrite reductase (NO-forming)
^^ norB Nitric oxide reductase subunit B
^^ norC Nitric oxide reductase subunit C
^^ nosZ Nitrous-oxide reductase
^^ narZ Nitrate reductase 2, alpha subunit
^^ narY Nitrate reductase 2, beta subunit
^^ narV Nitrate reductase 2, gamma subunit
^^ narW Nitrate reductase 2, delta subunit
Assimilatory nitrate reduction nasA Assimilatory nitrate reductase catalytic subunit
^^ nasB Assimilatory nitrate reductase electron transfer subunit
^^ nirA Ferredoxin-nitrite reductase
^^ NR Nitrate reductase (NAD(P)H)
^^ narB Assimilatory nitrate reductase
^^ narC Cytochrome b-561
Dissimilatory nitrate reduction napA Periplasmic nitrate reductase NapA
^^ napB Cytochrome c-type protein NapB
^^ napC Cytochrome c-type protein NapC
^^ narG Nitrate reductase
^^ narH Nitrate reductase
^^ narJ Nitrate reductase molybdenum cofactor assembly chaperone
^^ narI Nitrate reductase gamma subunit
^^ narZ Nitrate reductase 2, alpha subunit
^^ narY Nitrate reductase 2, beta subunit
^^ narV Nitrate reductase 2, gamma subunit
^^ narW Nitrate reductase 2, delta subunit
^^ nirB Nitrite reductase (NADH) large subunit
^^ nirD Nitrite reductase (NADH) small subunit
^^ nrfA Nitrite reductase (cytochrome c-552)
^^ nrfB Cytochrome c-type protein NrfB
^^ nrfC Protein NrfC
^^ nrfD Protein NrfD
Nitrogen fixation anfG Nitrogenase delta subunit
^^ nifD Nitrogenase molybdenum-iron protein alpha chain
^^ nifH Nitrogenase iron protein NifH
^^ nifK Nitrogenase molybdenum-iron protein beta chain
^^ nifW Nitrogenase-stabilizing/protective protein
Anammox hzo Hydrazine oxidoreductase
^^ hzsA Hydrazine synthase subunit A
^^ hzsB Hydrazine synthase subunit B
^^ hzsC Hydrazine synthase subunit C
^^ hdh Hydrazine dehydrogenase
Organic degradation and synthesis ureA Urease subunit gamma
^^ ureB Urease subunit beta
^^ ureC Urease subunit alpha
^^ nao Nitroalkane oxidase
^^ nmo Nitronate monooxygenase
^^ gdh_K00260 Glutamate dehydrogenase
^^ gdh_K00261 Glutamate dehydrogenase (NAD(P)+)
^^ gdh_K00262 Glutamate dehydrogenase (NADP+)
^^ gdh_K15371 Glutamate dehydrogenase
^^ gs_K00264 Glutamate synthase (NADPH/NADH)
^^ gs_K00265 Glutamate synthase (NADPH/NADH) large chain
^^ gs_K00266 Glutamate synthase (NADPH/NADH) small chain
^^ gs_K00284 Glutamate synthase (ferredoxin)
^^ glsA Glutaminase
^^ glnA Glutamine synthetase
^^ asnB Asparagine synthase (glutamine-hydrolysing)
^^ ansB Glutamin-(asparagin-)ase
Others hcp Hydroxylamine reductase
^^ pmoA Particulate methane monooxygenase subunit A
^^ pmoB Particulate methane monooxygenase subunit B
^^ pmoC Particulate methane monooxygenase subunit C
  • 带分组
1
2
3
4
5
6
7
8
9
10
11
12
13
14
setwd("E:/Researches/Xiaqian/NGS/CleanData/宏基因组数据/Result/NCyc")

data <- read.table("Ncycle.out.txt",header = TRUE, sep = "\t")

library(ggplot2)
library(reshape)
data_melt <- melt(data)
names(data_melt) = c("Genes", "Annotation", "Pathways", "Samples", "Abundances")
data_melt <-as.data.frame(data_melt)
bubble <- ggplot(data_melt[which(data_melt$Abundances>0),], aes(x = Samples, y = Genes, size = Abundances, color = Samples)) + theme_bw()+ labs(x = "Sediment layers", y = "N cycling genes")+ theme(axis.text.x = element_text(angle = 0, colour = "black", vjust = 1, hjust = 1, size = 10), axis.text.y = element_text(size = 10)) +
theme(panel.grid = element_blank(), panel.border = element_blank()) +
theme(panel.spacing = unit(.1, "lines")) +
theme(plot.margin=unit(c(1, 0, 0, 1), "cm"))+ geom_point()+ facet_grid(Pathways ~ ., drop=TRUE, scale="free",space="free", switch = "y") +
theme(strip.background = element_rect(fill = "grey95", colour = "white"), strip.text.y.left = element_text(angle=360), strip.text=element_text(size=10))

参考

脚本获取

关注公众号“生信之巅”,聊天窗口回复“18ea”获取下载链接。

生信之巅微信公众号 生信之巅小程序码

敬告:使用文中脚本请引用本文网址,请尊重本人的劳动成果,谢谢!Notice: When you use the scripts in this article, please cite the link of this webpage. Thank you!

上一篇:
R语言绘制分组条形图
下一篇:
NCBI上传基因簇之tbl2asn的使用