代谢组学常用仪器特点
仪器 | 特点 |
---|---|
GC-MS | 易挥发,低极性,热稳定的小分子化合物;需衍生化 |
LC-MS | 具有一定极性的有机化合物;无需衍生化 |
NMR | 无偏性,无损检测;•无需繁琐前处理,便于活体、原位的动态检测 |
CE-MS | 高极性化合物,如核酸,蛋白等 |
ICP-MS | 无机化合物 |
LC-QTOF 原理
Q-TOF 全称为四极飞行时间质谱仪(Quadrupole Time-of-Flight Mass Spectrometer)。其基本原理是将样品离子通过四极杆进行质量筛选,然后进入飞行时间质谱器(Time-of-Flight Mass Analyzer,ToF),测定其离子的飞行时间,从而得到样品中离子的质量信息。Q-TOF 有正离子模式(positive ion mode, POS)和负离子模式(negative ion mode, NEG)两种电离方式,在检测代谢组时结合使用两种方式可以使代谢物覆盖率更高,检测效果也更好。
Q-TOF 的工作过程包含以下步骤:
离子化:样品通过电喷雾或者 MALDI 等方法被电离成为离子。
生成离子束:离子经过引出阱、加速器、光栅偏转镜等装置,形成一个带电的离子束。
四极杆质量筛选:离子束进入四极杆,通过高频交变电压(RF)和直流电压(DC)的控制,将不同质量的离子分离出来。
飞行时间质谱分析:通过激光或者其他方法对分散的离子束施加助推、聚焦和分析,并且测定其到达检测器所需的时间。不同质量的离子抵达检测器所需要的时间不同,从而可以得到离子的质量信息。
FAQ:POS 和 NEG 数据的利用。
负离子和正离子是数据采集时的时候的两种模式生成两套数据,一级分析(MS)结果中是提供 POS 和 NEG 两个检测结果列表的,但是在二级分析(MSMS)结果中,我们将鉴定正负离子模式下鉴定到的物质进行了合并,所以二级分析是针对代谢物来做的,不区分正负离子模式。
FAQ: 当正负离子模式下都检测到了某种物质,如何对该物质的结果进行呈现?
会根据匹配参数进行取舍,选用匹配参数更好的模式下的数据在二级结果中进行分析。
数据分析
下机数据格式转换
以下两个软件二选一,Windows 下建议使用 ProteoWizard,其用法可参考用 metid 构建代谢组数据库。Linux 下建议使用 massconverter。
ProteoWizard
将 raw data 转换为 mzXML
格式。
massconverter
将 raw data 转换为 mzXML
格式。
安装
if(!require(remotes)){
install.packages("remotes")
}
remotes::install_github("tidymass/massconverter")
# 获取pwiz
library(massconverter)
docker_pull_pwiz()
Debug:如果报错 Error in docker_pull_pwiz() : Please install docker first (https://www.docker.com/get-started)
,则首先确认系统是否已安装 docker,如果未安装,请先安装 docker;如已经安装,则是因为普通用户无运行 docker 的权限,将其添加到 docker 用户组中即可。
# root模式下运行如下shell命令,将用户lhl添加到docker组中
gpasswd -a lhl docker
# 重启docker生效
service docker restart
设置转换参数
此处调用 MSConvert 进行格式转换,期间进行了过滤,主要采用了 Peak Picking
, Subset
,和 Zero Samples
方法。
parameter =
massconverter::create_msconvert_parameter(
output_format = "mzXML",# "mzXML", "mzML", "mz5", "mgf", "text", "ms1", "cms1", "ms2", "cms2"
binary_encoding_precision = "64",
zlib = TRUE,
write_index = TRUE,
peak_picking_algorithm = "cwt",# "vendor" or "cwt"
vendor_mslevels = c(1, NA),# 当前调用的是cwt,此参数可能无效
cwt_mslevels = c(1, NA),# 如此设置即可
cwt_min_snr = 0.1,# minimum signal-to-noise ratio,设为0.1即可
cwt_min_peak_spacing = 0.1,# minimum peak spacing,设为0.1即可
subset_polarity = "positive",# "any", "positive" or "negative",根据模式选择
subset_scan_number = c(NA, NA),# A two numeric vector. Can be c(NA, NA) if don't use this
subset_scan_time = c(60, 300),# A two numeric vector. Can be c(NA, NA) if don't use this
subset_mslevels = c(1, 2),# A two numeric vector. Second can be set as NA
zero_samples_mode = "removeExtra",# "no", "removeExtra", or "addMissing"
zero_samples_mslevels = c(1, NA),# A two numeric vector. Second can be set as NA
zero_samples_add_missing_flanking_zero_count = 5# 默认5即可
)
开始转换
convert_raw_data(input_path = "demo_data/raw_data",
output_path = "demo_data/mzxml",
msconvert_parameter = parameter,
docker_parameters = c(),
process_all = FALSE)
安装 tidyMass
mamba install gcc_linux-64 gxx_linux-64 gfortran_linux-64
mamba install bioconductor-msnbase bioconductor-rdisop r-openxlsx bioconductor-mzr bioconductor-xcms r-rvest r-tidyr r-stringi r-tidyversedyve r-hexbin
if(!require(remotes)){
install.packages("remotes")
}
remotes::install_github("tidymass/masstools")
remotes::install_github("tidymass/tidymass")
原始数据处理
调用 massprocesser 包进行原始数据处理,进行峰检测与对齐,输出 peak table 包括:
原始数据导入(read spectra from file)、峰检测(detect mass traces & detect chromatographic peaks)、校正保留时间(Correcting rentention time)、对已鉴定的色谱峰进行保留时间校正、Grouping peaks across samples、输出 peak table。
数据目录结构
- mzxml/mzxml_ms1_data/
- POS/
- Case/*.mzXML
- Control/*.mzXML
- QC/*.mzXML
- NEG/
- Case/*.mzXML
- Control/*.mzXML
- QC/*.mzXML
- POS/
- mzxml/mgf_ms2_data/mgf_ms2_data/
- POS/*.mgf
- NEG/*.mgf
- mzxml/sample_info/
- sample_info_pos.csv
- sample_info_neg.csv
正离子模式
注意:path 指定的路径名不可以数字开头!
library(tidymass)
process_data(
path = "mzxml_ms1_data/POS",# 路径根据实际情况定
polarity = "positive",
ppm = 10,
peakwidth = c(10, 60),
threads = 30,
output_tic = TRUE,
output_bpc = TRUE,
output_rt_correction_plot = TRUE,
min_fraction = 0.5,
group_for_figure = "QC",
snthresh = 10,
prefilter = c(3, 500),
fitgauss = FALSE,
integrate = 2,
mzdiff = 0.01,
noise = 500,
binSize = 0.025,
bw = 5,
fill_peaks = FALSE
)
- ppm:numeric defining the maximal tolerated m/z deviation in consecutive scans in parts per million (ppm) for the initial regions-of-interest (ROI) definition。此处的 ppm 来源于 xcms 包,但在 xcms 中不同的函数中,含义貌似不一样,需进一步确认。
- peakwith:峰宽,取决于柱子(column)和 LC 系统。numeric (2) with the expected approximate peak width in chromatographic space. Given as a range (min, max) in seconds.
- min_fraction:如果一个峰出现在至少一组样本中
min_fraction
样本以上(按比例),则将保留该峰值。 - snthresh :信噪比阈值。
- prefilter:c (k, I) 在第一次分析步骤(ROI 检测)中指定预筛选步骤。仅保留包含至少强度 >=I 的 k 个峰的的质量轨迹(Mass traces)。
- fitgauss :是否应对每个峰拟合高斯分布。主要影响峰的保留时间位置。
- integrate:积分方法(1 或 2)。对于 integrate=1,Peak limits 通过在 mexican hat 过滤后的数据上下降(descent)来确定;而对于 integrate=2,则在原始数据上进行下降处理。后一种方法更准确但容易受到噪声的影响,而前一种方法更稳健但不太精确。
- mzdiff :representing the minimum difference in m/z dimension required for peaks with overlapping retention times; can be negative to allow overlap. During peak post-processing, peaks defined to be overlapping are reduced to the one peak with the largest signal.
- noise:设定第一步分析中需要的质心(centroids )最小强度(intensity < noise 的质心将被省略在感兴趣区域检测之外)。
- binSize :将 m/z 轴按照 binSize 划分为多个区间,再对各区间内的离子信号进行聚合。较小的 binSize 使得峰检测更加准确,检测到更多 features,需要更多计算资源。
- bw:带宽,通常在 5-50 间,这是用于峰值检测的核密度估计 (KDE) 算法中使用的参数。bw 控制估计的 KDE 曲线的平滑度,并确定峰的解析程度。较小的 bw 值将产生更详细的基础峰形表示,在紧密间隔或重叠的峰之间具有更好的分辨率,但也可能产生更高水平的噪声。相反,较大的 bw 值将导致对峰形的估计更平滑、更广义,这有助于降低噪声和杂散检测,但在某些情况下也可能导致峰分辨率降低。bw 值的选择将取决于所分析数据的具体特征以及分析的目标。通常,有必要试验一系列 bw 值,以找到给定数据集的最佳设置。
- fill_peaks:Fill peaks NA or not。
结果:
POS/Result/
object:用于下一步分析
Peak_table.csv:用于下一步分析的峰表
Peak_table_for_cleaning.csv:删除了
Peak_table.csv
中无用的列,可直接用于后续的数据清洗BPC.pdf:Base peak chromatogram,仅展示
group_for_figure
参数指定组内的样本TIC.pdf:Total ion peak chromatogram,经色谱分离流出的组分不断进入质谱,质谱连续扫描采集数据,每一次扫描得到一张质谱图,将每一张质谱图中所有离子强度相加得到总离子流强度。然后以总离子强度为纵坐标,时间为横坐标绘图。
RT correction plot.pdf
intermediate_data:中间文件目录,如果需要重新运行 data processing,则需先删除该目录。
# 加载对象 load("mzxml_ms1_data/POS/Result/object") # 查看metabolic features数量 object # 获取互动图,在Rstudio中才能显示 load("mzxml_ms1_data/POS/Result/intermediate_data/xdata2") plot = massprocesser::plot_adjusted_rt(object = xdata2, group_for_figure = "QC", interactive = TRUE) plot
负离子模式
massprocesser::process_data(
path = "mzxml_ms1_data/NEG",
polarity = "negative",
ppm = 10,
peakwidth = c(10, 60),
threads = 30,
output_tic = TRUE,
output_bpc = TRUE,
output_rt_correction_plot = TRUE,
min_fraction = 0.5,
group_for_figure = "QC",
snthresh = 10,
prefilter = c(3, 500),
fitgauss = FALSE,
integrate = 2,
mzdiff = 0.01,
noise = 500,
binSize = 0.025,
bw = 5,
fill_peaks = FALSE
)
# 加载对象
load("mzxml_ms1_data/NEG/Result/object")
# 查看metabolic features数量
object
Explore data
将 peak table 和样本信息(元数据)整合,得到 mass_dataset
class 对象。获取数据的总结信息。
library(tidyverse)
正离子模式
# 加载对象
load("mzxml_ms1_data/POS/Result/object")
object_pos <- object
object_pos
# 读入样本信息
sample_info_pos <- readr::read_csv("sample_info/sample_info_pos.csv")
# 查看object_pos中的元数据
object_pos %>% extract_sample_info() %>% head()
#> sample_id group class injection.order
#> 1 sample_06 Case Subject 1
#> 2 sample_103 Case Subject 2
#> 3 sample_11 Case Subject 3
#> 4 sample_112 Case Subject 4
#> 5 sample_117 Case Subject 5
#> 6 sample_12 Case Subject 6
#?? 为何还没整合的情况下object_pos中就存在了部分元数据列?
# 移除object_pos中的"group", "class", "injection.order"
object_pos <- object_pos %>% activate_mass_dataset(what = "sample_info") %>% dplyr::select(-c("group", "class", "injection.order"))
# 将sample_info_pos 中的所有列整合到object_pos 中
object_pos = object_pos %>% activate_mass_dataset(what = "sample_info") %>% left_join(sample_info_pos, by = "sample_id")
# 查看元数据信息
object_pos %>% extract_sample_info() %>% head()
#> sample_id injection.order class batch subject_id group
#> 1 sample_06 4 Subject 1 subject_414 Case
#> 2 sample_103 71 Subject 1 subject_330 Case
#> 3 sample_11 6 Subject 1 subject_125 Case
#> 4 sample_112 78 Subject 1 subject_295 Case
#> 5 sample_117 80 Subject 1 subject_793 Case
#> 6 sample_12 8 Subject 1 subject_387 Case
# 保存数据
dir.create("data_cleaning/POS", showWarnings = FALSE, recursive = TRUE)
save(object_pos, file = "data_cleaning/POS/object_pos")
# 统计样本数和variables数
object_pos
# 根据class统计样本数量,可将class换成group或batch等
object_pos %>% activate_mass_dataset(what = "sample_info") %>% dplyr::count(class)
# 获取peak分布图
pdf(file="data_cleaning/POS/peak_distributation_plot_positive.pdf")
p<- object_pos %>% `+`(1) %>% log(10) %>% show_mz_rt_plot(hex = FALSE)
p+ scale_size_continuous(range = c(0.01, 2))
dev.off()
# 查看总缺失值数量
get_mv_number(object = object_pos)
#[1] 786005
# 查看各样本内的缺失值
get_mv_number(object = object_pos, by = "sample") %>% head()
# sample_06 sample_103 sample_11 sample_112 sample_117 sample_12
# 4017 2713 4064 2981 2920 3846
# 查看各variable的缺失值
get_mv_number(object = object_pos, by = "variable") %>% head()
#M70T73_POS M70T53_POS M70T183_POS M70T527_POS M71T695_POS M71T183_POS
# 129 16 155 54 133 169
# 绘图展示缺失值信息,可在下一节生成
pdf(file="data_cleaning/POS/total_MVs.pdf")
show_missing_values(object = object_pos, show_column_names = TRUE, show_row_names = TRUE, percentage = TRUE)
dev.off()
# 绘图展示各样本缺失值信息,可在下一节生成
pdf(file="data_cleaning/POS/Samples_MVs.pdf")
show_sample_missing_values(object = object_pos, percentage = TRUE, color_by = "class")
dev.off()
# 绘图展示各variables缺失值信息,可在下一节生成
pdf(file="data_cleaning/POS/Variables_MVs.pdf")
p<- show_variable_missing_values(
object = object_pos,
percentage = TRUE,
show_x_text = FALSE,
show_x_ticks = FALSE,
color_by = "mz"
)
p+ scale_size_continuous(range = c(0.01, 1))
dev.off()
黑色代表缺失值。
负离子模式
# 加载对象
load("mzxml_ms1_data/NEG/Result/object")
object_neg <- object
object_neg
# 读入样本信息
sample_info_neg <- readr::read_csv("sample_info/sample_info_neg.csv")
object_neg %>% extract_sample_info() %>% head()
# sample_id group class injection.order
#1 sample_06 Case Subject 1
#2 sample_103 Case Subject 2
#3 sample_11 Case Subject 3
#4 sample_112 Case Subject 4
#5 sample_117 Case Subject 5
#6 sample_12 Case Subject 6
object_neg <- object_neg %>% activate_mass_dataset(what = "sample_info") %>% dplyr::select(-c("group", "class", "injection.order"))
# 将sample_info_neg添加至object_neg
object_neg = object_neg %>% activate_mass_dataset(what = "sample_info") %>% left_join(sample_info_neg, by = "sample_id")
object_neg %>% extract_sample_info() %>% head()
# sample_id injection.order class batch subject_id group
#1 sample_06 4 Subject 1 subject_414 Case
#2 sample_103 71 Subject 1 subject_330 Case
#3 sample_11 6 Subject 1 subject_125 Case
#4 sample_112 78 Subject 1 subject_295 Case
#5 sample_117 80 Subject 1 subject_793 Case
#6 sample_12 8 Subject 1 subject_387 Case
# 保存数据
dir.create("data_cleaning/NEG", showWarnings = FALSE, recursive = TRUE)
save(object_neg, file = "data_cleaning/NEG/object_neg")
# 统计样本数和variables数
object_neg
# 根据class统计样本数量,可将class换成group或batch等
object_neg %>% activate_mass_dataset(what = "sample_info") %>% dplyr::count(class)
# 获取peak分布图
pdf(file="data_cleaning/NEG/peak_distributation_plot_negtive.pdf")
object_neg %>% `+`(1) %>% log(10) %>% show_mz_rt_plot() + scale_size_continuous(range = c(0.01, 2))
dev.off()
# 查看总缺失值数量
get_mv_number(object = object_neg)
#[1] 748237
# 查看各样本内的缺失值
get_mv_number(object = object_neg, by = "sample") %>% head()
# sample_06 sample_103 sample_11 sample_112 sample_117 sample_12
# 3029 2766 3298 3100 2912 3053
# 查看各variable的缺失值
get_mv_number(object = object_neg, by = "variable") %>% head()
#M70T712_NEG M70T527_NEG M70T587_NEG M70T47_NEG M71T587_NEG M71T641_NEG
# 16 137 2 146 41 19
# 绘图展示缺失值信息,可在下一节生成
pdf(file="data_cleaning/NEG/total_MVs.pdf")
show_missing_values(object = object_neg, show_column_names = FALSE, percentage = TRUE)
dev.off()
# 绘图展示各样本缺失值信息,可在下一节生成
pdf(file="data_cleaning/NEG/Samples_MVs.pdf")
show_sample_missing_values(object = object_neg, percentage = TRUE)
dev.off()
# 绘图展示各variables缺失值信息,可在下一节生成
pdf(file="data_cleaning/NEG/Variables_MVs.pdf")
p<- show_variable_missing_values(
object = object_neg,
percentage = TRUE,
show_x_text = FALSE,
show_x_ticks = FALSE
)
p+ scale_size_continuous(range = c(0.01, 1))
dev.off()
数据清洗(Data cleaning)
查看质量
# 加载数据
load("data_cleaning/POS/object_pos")
load("data_cleaning/NEG/object_neg")
# 将批次号改为字符串
object_pos <- object_pos %>% activate_mass_dataset(what = "sample_info") %>% dplyr::mutate(batch = as.character(batch))
object_neg <- object_neg %>% activate_mass_dataset(what = "sample_info") %>% dplyr::mutate(batch = as.character(batch))
# 先评估数据质量
massqc::massqc_report(object = object_pos, path = "data_cleaning/POS/data_quality_before_data_cleaning")
massqc::massqc_report(object = object_neg, path = "data_cleaning/NEG/data_quality_before_data_cleaning")
从上两图可看出正离子模式下存在严重的批次效应,负离子模式下不存在明显的批次效应。
开始清洗
移除噪音代谢 features—— 缺失值过滤。
移除超过 20% QC 样本中含有 MVs 以及至少在 50% 实验组或对照组样本中含有 MVs 的 variables。
正离子模式
# 查看各分组样本量 object_pos %>% activate_mass_dataset(what = "sample_info") %>% dplyr::count(group) #> group n #> 1 Case 110 #> 2 Control 110 #> 3 QC 39 # QC样本中的MV占比 pdf(file="data_cleaning/POS/MVpercentQC.pdf") p<- show_variable_missing_values(object = object_pos %>% activate_mass_dataset(what = "sample_info") %>% filter(class == "QC"), percentage = TRUE) p+ scale_size_continuous(range = c(0.01, 2)) dev.off() # 统计QC中的MV占比 qc_id = object_pos %>% activate_mass_dataset(what = "sample_info") %>% filter(class == "QC") %>% pull(sample_id) # 统计对照组中的MV占比 control_id = object_pos %>% activate_mass_dataset(what = "sample_info") %>% filter(group == "Control") %>% pull(sample_id) # 统计实验组中的MV占比 case_id = object_pos %>% activate_mass_dataset(what = "sample_info") %>% filter(group == "Case") %>% pull(sample_id) # 整合以上统计信息 object_pos = object_pos %>% mutate_variable_na_freq(according_to_samples = qc_id) %>% mutate_variable_na_freq(according_to_samples = control_id) %>% mutate_variable_na_freq(according_to_samples = case_id) head(extract_variable_info(object_pos)) #variable_id mz rt na_freq na_freq.1 na_freq.2 #1 M70T73_POS 70.04074 73.31705 0.28205128 0.6000000 0.4727273 #2 M70T53_POS 70.06596 52.78542 0.00000000 0.1454545 0.0000000 #3 M70T183_POS 70.19977 182.87981 0.00000000 0.6636364 0.7454545 #4 M70T527_POS 70.36113 526.76657 0.02564103 0.1818182 0.3000000 #5 M71T695_POS 70.56911 694.84592 0.02564103 0.6454545 0.5545455 #6 M71T183_POS 70.75034 182.77790 0.05128205 0.7272727 0.7909091 # 移除variables object_pos <- object_pos %>% activate_mass_dataset(what = "variable_info") %>% filter(na_freq < 0.2 & (na_freq.1 < 0.5 | na_freq.2 < 0.5)) object_pos
注:na_freq、na_freq.1 和 na_freq.2 在此处分别代表某 variables 在 QC 样本、对照组样本和实验组样本中缺失值 MV 所占的百分比。
负离子模式
# 查看各分组样本量 object_neg %>% activate_mass_dataset(what = "sample_info") %>% dplyr::count(group) #> group n #> 1 Case 110 #> 2 Control 110 #> 3 QC 39 # QC样本中的MV占比 pdf(file="data_cleaning/NEG/MVpercentQC.pdf") p<- show_variable_missing_values(object = object_neg %>% activate_mass_dataset(what = "sample_info") %>% filter(class == "QC"), percentage = TRUE) p+ scale_size_continuous(range = c(0.01, 2)) dev.off() # 统计QC中的MV占比 qc_id = object_neg %>% activate_mass_dataset(what = "sample_info") %>% filter(class == "QC") %>% pull(sample_id) # 统计对照组中的MV占比 control_id = object_neg %>% activate_mass_dataset(what = "sample_info") %>% filter(group == "Control") %>% pull(sample_id) # 统计实验组中的MV占比 case_id = object_neg %>% activate_mass_dataset(what = "sample_info") %>% filter(group == "Case") %>% pull(sample_id) # 整合以上统计信息 object_neg = object_neg %>% mutate_variable_na_freq(according_to_samples = qc_id) %>% mutate_variable_na_freq(according_to_samples = control_id) %>% mutate_variable_na_freq(according_to_samples = case_id) head(extract_variable_info(object_neg)) #variable_id mz rt na_freq na_freq.1 na_freq.2 #1 M70T712_NEG 70.05911 711.97894 0.05128205 0.109090909 0.018181818 #2 M70T527_NEG 70.13902 526.85416 0.33333333 0.509090909 0.618181818 #3 M70T587_NEG 70.31217 587.25330 0.00000000 0.009090909 0.009090909 #4 M70T47_NEG 70.33835 46.57537 0.84615385 0.936363636 0.090909091 #5 M71T587_NEG 70.56315 587.02570 0.17948718 0.145454545 0.163636364 #6 M71T641_NEG 70.70657 641.16672 0.10256410 0.063636364 0.072727273 # 移除variables object_neg <- object_neg %>% activate_mass_dataset(what = "variable_info") %>% filter(na_freq < 0.2 & (na_freq.1 < 0.5 | na_freq.2 < 0.5)) object_neg
过滤离群样本(Filter outlier samples)
正离子模式
# 总览 pdf(file="data_cleaning/POS/MVpercentALL.pdf") massdataset::show_sample_missing_values(object = object_pos, color_by = "group", order_by = "injection.order", percentage = TRUE) + theme(axis.text.x = element_text(size = 2)) + scale_size_continuous(range = c(0.1, 2)) + ggsci::scale_color_aaas() dev.off() # 检测离群样本 outlier_samples = object_pos %>% `+`(1) %>% log(2) %>% scale() %>% detect_outlier() outlier_samples #-------------------- #masscleaner #-------------------- #1 according_to_na : 0 outlier samples. #2 according_to_pc_sd : 0 outlier samples. #3 according_to_pc_mad : 0 outlier samples. #4 accordint_to_distance : 0 outlier samples. #4 accordint_to_distance : 0 outlier samples. #extract all outlier table using extract_outlier_table() outlier_table <- extract_outlier_table(outlier_samples) outlier_table %>% head() # according_to_na pc_sd pc_mad accordint_to_distance #sample_06 FALSE FALSE FALSE FALSE #sample_103 FALSE FALSE FALSE FALSE #sample_11 FALSE FALSE FALSE FALSE #sample_112 FALSE FALSE FALSE FALSE #sample_117 FALSE FALSE FALSE FALSE #sample_12 FALSE FALSE FALSE FALSE outlier_table %>% apply(1, function(x){ sum(x) }) %>% `>`(0) %>% which() # #named integer(0) ##无离群样本
# 删除离群样本,如果有的话 need_id_pos <- names(outlier_table %>% apply(1, function(x){ sum(x) }) %>% `==`(0) %>% which()) object_pos <- object_pos %>% activate_mass_dataset(what = "sample_info") %>% filter(sample_id %in% need_id_pos)
负离子模式
# 总览 pdf(file="data_cleaning/NEG/MVpercentALL.pdf") p<- massdataset::show_sample_missing_values(object = object_neg, color_by = "group", order_by = "injection.order", percentage = TRUE) p+ theme(axis.text.x = element_text(size = 2)) + scale_size_continuous(range = c(0.1, 2)) + ggsci::scale_color_aaas() dev.off() # 检测离群样本 outlier_samples = object_neg %>% `+`(1) %>% log(2) %>% scale() %>% detect_outlier() outlier_samples #-------------------- #masscleaner #-------------------- #1 according_to_na : 0 outlier samples. #2 according_to_pc_sd : 0 outlier samples. #3 according_to_pc_mad : 0 outlier samples. #4 accordint_to_distance : 0 outlier samples. #4 accordint_to_distance : 0 outlier samples. #extract all outlier table using extract_outlier_table() outlier_table <- extract_outlier_table(outlier_samples) outlier_table %>% head() # according_to_na pc_sd pc_mad accordint_to_distance #sample_06 FALSE FALSE FALSE FALSE #sample_103 FALSE FALSE FALSE FALSE #sample_11 FALSE FALSE FALSE FALSE #sample_112 FALSE FALSE FALSE FALSE #sample_117 FALSE FALSE FALSE FALSE #sample_12 FALSE FALSE FALSE FALSE outlier_table %>% apply(1, function(x){ sum(x) }) %>% `>`(0) %>% which() # #named integer(0) ##无离群样本
缺失值填充(Missing value imputation)
对原始数据中的缺失值进行模拟(missing value recoding)。方法包括:"knn", "rf" (missForest), "mean", "median", "zero", "minimum", "bpca", "svdImpute", "ppca"。
# 获取正离子模式下的MV数量
get_mv_number(object_pos)
#> [1] 148907
# 填充正离子模式缺失值
object_pos <- impute_mv(object = object_pos, method = "knn")
# 获取正离子模式下填充后的MV数量
get_mv_number(object_pos)
#> [1] 0
# 获取负离子模式下的MV数量
get_mv_number(object_neg)
#> [1] 146409
# 填充正离子模式缺失值
object_neg <- impute_mv(object = object_neg, method = "knn")
# 获取正离子模式下填充后的MV数量
get_mv_number(object_neg)
#> [1] 0
数据标准化与整合(Data normalization and integration)
数据标准化处理, 利用内标(internal standard, IS)进行归一化待确认是否通过 IS 实现。
正离子模式
标准化方法包括:"svr", "total", "median", "mean", "pqn", "loess";整合方法包括:"qc_mean", "qc_median", "subject_mean", "subject_median"。
object_pos <- normalize_data(object_pos, method = "median") object_pos2 <- integrate_data(object_pos, method = "subject_median") # 按批次分组绘制PCA图 pdf(file="data_cleaning/POS/PC_batch_intergrated.pdf") object_pos2 %>% `+`(1) %>% log(2) %>% massqc::massqc_pca(color_by = "batch", line = FALSE) dev.off()
负离子模式
object_neg <- normalize_data(object_neg, method = "median") object_neg2 <- integrate_data(object_neg, method = "subject_median") # 按批次分组绘制PCA图 pdf(file="data_cleaning/NEG/PC_batch_intergrated.pdf") object_neg2 %>% `+`(1) %>% log(2) %>% massqc::massqc_pca(color_by = "batch", line = FALSE) dev.off()
保存数据
save(object_pos2, file = "data_cleaning/POS/object_pos2")
save(object_neg2, file = "data_cleaning/NEG/object_neg2")
代谢物注释
加载数据
load("data_cleaning/POS/object_pos2")
load("data_cleaning/NEG/object_neg2")
将 MS2 数据添加到 mass_dataset
如果没有 MS2 数据,此步不执行应该也可以!
??只有 QC 样的 MS2 数据,MS2 数据是怎么来的?
正离子模式
object_pos2 <- mutate_ms2( object = object_pos2, column = "rp", # rp or hilic,对应RPLC(反相色谱)和HILIC(亲水相互作用色谱) polarity = "positive", ms1.ms2.match.mz.tol = 15,# ppm ms1.ms2.match.rt.tol = 30,# seconds path = "mgf_ms2_data/mgf_ms2_data/POS" ) #1043 out of 5101 variable have MS2 spectra. #Selecting the most intense MS2 spectrum for each peak... # summary extract_ms2_data(object_pos2)
负离子模式
object_neg2 <- mutate_ms2( object = object_neg2, column = "rp", polarity = "negative", ms1.ms2.match.mz.tol = 15, ms1.ms2.match.rt.tol = 30, path = "mgf_ms2_data/mgf_ms2_data/NEG" ) #1092 out of 4104 variable have MS2 spectra. #Selecting the most intense MS2 spectrum for each peak... # summary extract_ms2_data(object_neg2)
代谢物注释
需要考虑数据库是 RPLC 还是 HILIC。
正离子模式
# Annotate features using snyder_database_rplc0.0.3 load("metabolite_annotation/snyder_database_rplc0.0.3.rda") ## 查看数据库信息 snyder_database_rplc0.0.3 ## 注释 object_pos2 <- annotate_metabolites_mass_dataset( object = object_pos2, ms1.match.ppm = 15, rt.match.tol = 30, polarity = "positive", database = snyder_database_rplc0.0.3, threads =30) # Annotate features using orbitrap_database0.0.3 load("metabolite_annotation/orbitrap_database0.0.3.rda") ## 查看数据库信息 orbitrap_database0.0.3 ## 注释 object_pos2 <- annotate_metabolites_mass_dataset( object = object_pos2, ms1.match.ppm = 15, polarity = "positive", database = orbitrap_database0.0.3, threads =30) # Annotate features using mona_database0.0.3 load("metabolite_annotation/mona_database0.0.3.rda") ## 查看数据库信息 mona_database0.0.3 ## 注释 object_pos2 <- annotate_metabolites_mass_dataset( object = object_pos2, ms1.match.ppm = 15, polarity = "positive", database = mona_database0.0.3, threads =30)
annotate_metabolites_mass_dataset 参数解析:
- ms1.match.ppm:Precursor match ppm tolerance [25].
- ms2.match.ppm:Fragment ion match ppm tolerance [30].
- mz.ppm.thr:Accurate mass tolerance for m/z error calculation [400].
- ms2.match.tol:MS2 match (MS2 similarity) tolerance [0.5].
- fraction.weight:The weight for matched fragments [0.3].
- dp.forward.weight:Forward dot product weight [0.6].
- dp.reverse.weight:Reverse dot product weight [0.1].
- remove_fragment_intensity_cutoff:remove_fragment_intensity_cutoff [0].
- rt.match.tol:RT match tolerance [30].
- polarity:The polarity of data, "positive"or "negative".
- ce:Collision energy. Please confirm the CE values in your database. [all].
- column:"hilic" (HILIC column) or "rp" (reverse phase).
- ms1.match.weight:The weight of MS1 match for total score calculation [0.25].
- rt.match.weight:The weight of RT match for total score calculation [0.25].
- ms2.match.weight:The weight of MS2 match for total score calculation [0.5]
- total.score.tol:Total score tolerance. The total score are referring to MS-DIAL [0.5]
- candidate.num:The number of candidate [3]
负离子模式
# Annotate features using snyder_database_rplc0.0.3 object_neg2 <- annotate_metabolites_mass_dataset( object = object_neg2, ms1.match.ppm = 15, rt.match.tol = 30, polarity = "negative", database = snyder_database_rplc0.0.3) # Annotate features using orbitrap_database0.0.3 object_neg2 <- annotate_metabolites_mass_dataset( object = object_neg2, ms1.match.ppm = 15, polarity = "negative", database = orbitrap_database0.0.3) # Annotate features using mona_database0.0.3 object_neg2 <- annotate_metabolites_mass_dataset( object = object_neg2, ms1.match.ppm = 15, polarity = "negative", database = mona_database0.0.3)
查看注释结果
head(extract_annotation_table(object = object_pos2))
variable_info_pos <- extract_variable_info(object = object_pos2)
head(variable_info_pos)
table(variable_info_pos$Level)
table(variable_info_pos$Database)
保存数据
save(object_pos2, file = "metabolite_annotation/object_pos2")
save(object_neg2, file = "metabolite_annotation/object_neg2")
统计分析
加载数据
library(tidymass)
library(tidyverse)
load("metabolite_annotation/object_pos2")
load("metabolite_annotation/object_neg2")
移除未被注释的 features
正离子模式
object_pos2 <- object_pos2 %>% activate_mass_dataset(what = "annotation_table") %>% filter(!is.na(Level)) %>% filter(Level == 1 | Level == 2) object_pos2
负离子模式
object_neg2 <- object_neg2 %>% activate_mass_dataset(what = "annotation_table") %>% filter(!is.na(Level)) %>% filter(Level == 1 | Level == 2) object_neg2
合并 POS 和 NEG 数据
# inner merge for samples and full merge for variables
object <-
merge_mass_dataset(
x = object_pos2,
y = object_neg2,
sample_direction = "inner",# left, right, inner or full,此处用inner较合理
variable_direction = "full",# left, right, inner or full,此处用full合理
sample_by = "sample_id", # merge samples by what columns from sample_info
variable_by = c("variable_id", "mz", "rt")# merge variables by what columns from variable_info
)
object
名词解释:
- 左连接 (Left join):将左表中的所有记录都保留下来,而右表中与左表中记录没有匹配的部分则丢弃。
- 右连接 (Right join):将右表中的所有记录都保留下来,而左表中与右表中记录没有匹配的部分则丢弃。
- 内连接 (Inner join):只有两个表中都存在的记录才保留下来,否则丢弃。
- 全连接 (Full join):将左表和右表中的所有记录都保留下来,如果某个表中的记录没有匹配到另一个表中的记录,则用 NULL 填充。
Trace processing information of object
dir.create(path = "statistical_analysis", showWarnings = FALSE)
report_parameters(object = object, path = "statistical_analysis/")
基于 level 和 score 移除冗余注释代谢物
object <- object %>%
activate_mass_dataset(what = "annotation_table") %>%
group_by(Compound.name) %>%
filter(Level == min(Level)) %>%
filter(SS == max(SS)) %>%
slice_head(n = 1)
object <- object %>%
activate_mass_dataset(what = "annotation_table") %>%
group_by(variable_id) %>%
filter(Level == min(Level)) %>%
filter(SS == max(SS)) %>%
slice_head(n = 1)
为何要选最小的 level?SS 是什么 score?
样本聚类
load("statistical_analysis/object_final")
# 排除QC样本
temp_object <- object_final %>%
activate_mass_dataset(what = "sample_info") %>%
filter(group != "QC") %>%
`+`(1) %>%
log(2) %>%
scale()
library(ComplexHeatmap)
h1 <- HeatmapAnnotation(class = extract_sample_info(temp_object)$group)
pdf(file="statistical_analysis/Samples_heatmap.pdf")
massstat::Heatmap(matrix = temp_object,
name = "z-score",
row_names_gp = gpar(cex = 0.2),
column_names_gp = gpar(cex = 0.2),
top_annotation = h1)
dev.off()
差异表达代谢物
如果有多组数据,需要适当增加相应下述代码。“mutate_fc” 的逻辑是每运行一次,则在 object 中增加一列 “fc”,当有多个实验组时,可以将先前生成的 fc 重命名,因此,object 中可以包含多个组间比较的 fc 结果。
计算变化倍数
# 获取对照组样本名列表 control_sample_id = object %>% activate_mass_dataset(what = "sample_info") %>% filter(group == "Control") %>% pull(sample_id) # 获取实验组样本名列表 case_sample_id = object %>% activate_mass_dataset(what = "sample_info") %>% filter(group == "Case") %>% pull(sample_id) #! 如果有多个实验组,参照以上格式在此列出,假设有实验组Case2 case2_sample_id = object %>% activate_mass_dataset(what = "sample_info") %>% filter(group == "Case2") %>% pull(sample_id) # Calculate fold change,每次只能计算两个分组,如果有多个实验组,则依次将其与对照组比较 object <- mutate_fc(object = object, control_sample_id = control_sample_id, case_sample_id = case_sample_id, mean_median = "mean")#可选"mean", "median" #> 110 control samples. #> 110 case samples. #! 如果有多个实验组,则需要在此更改fc列的默认名,假设将默认的“fc”改为“fc_case1_vs_control” object <- object %>% activate_mass_dataset(what = "variable_info") %>% dplyr::rename(fc_case1_vs_control = fc) #! Calculate fold change,每次只能计算两个分组,如果有多个实验组,则依次将其与对照组比较,此处计算Case2与Control object <- mutate_fc(object = object, control_sample_id = control_sample_id, case_sample_id = case2_sample_id, mean_median = "mean") #> 110 control samples. #> 110 case samples. #! 如果有多个实验组,则需要在此更改fc列的默认名,假设将默认的“fc”改为“fc_case2_vs_control” object <- object %>% activate_mass_dataset(what = "variable_info") %>% dplyr::rename(fc_case2_vs_control = fc)
计算 p 值
同理,当有多个实验组时,也可以分别计算 p 值,并将默认的列名 “p_value” 和 “p_value_adjust” 重命名,以在 object 中容纳多组相互比较的 p 值。object <- mutate_p_value( object = object, control_sample_id = control_sample_id, case_sample_id = case_sample_id, method = "t.test",# "t.test", "wilcox.test" p_adjust_methods = "BH"# "holm", "hochberg", "hommel", "bonferroni", "BH", "BY", "fdr", "none" ) #> 110 control samples. #> 110 case samples. #! 如果涉及多组间的比较,则需重命名默认表头 object <- object %>% activate_mass_dataset(what = "variable_info") %>% dplyr::rename(p_value_case1_vs_control = p_value, p_value_adjust_case1_vs_control = p_value_adjust) #! 假设存在Case2组 object <- mutate_p_value( object = object, control_sample_id = control_sample_id, case_sample_id = case2_sample_id, method = "t.test", p_adjust_methods = "BH" ) #> 110 control samples. #> 110 case samples. #! 如果涉及多组间的比较,则需重命名默认表头 object <- object %>% activate_mass_dataset(what = "variable_info") %>% dplyr::rename(p_value_case2_vs_control = p_value, p_value_adjust_case2_vs_control = p_value_adjust)
绘制火山图
pdf(file="statistical_analysis/Volcano.pdf") p<- volcano_plot( object = object, fc_column_name = "fc", p_value_column_name = "p_value_adjust", fc_up_cutoff = 1, fc_down_cutoff = 1, p_value_cutoff = 0.05, add_text = TRUE, text_from = "Compound.name", point_size_scale = "p_value_adjust" ) p + scale_size_continuous(range = c(0.5, 5)) dev.off()
假设存在 Case2,绘制火山图。
pdf(file="statistical_analysis/Volcano2.pdf") p<- volcano_plot(object = object, fc_column_name = "fc_case2_vs_control",# 重命名后 p_value_column_name = "p_value_adjust_case2_vs_control",# 重命名后 fc_up_cutoff = 1, fc_down_cutoff = 1, p_value_cutoff = 0.05, add_text = TRUE, text_from = "Compound.name", point_size_scale = "p_value_adjust_case2_vs_control") # 重命名后 p + scale_size_continuous(range = c(0.5, 5)) dev.off()
保存结果
保存差异表达代谢物结果。如果有多个实验组,则将下述代码中的 “fc” 和 “p_value_adjust” 更改为重命名后的名称,并分别保存到不同的文件中。
differential_metabolites <- extract_variable_info(object = object) %>% filter(fc > 2 | fc < 0.5) %>% filter(p_value_adjust < 0.05) readr::write_csv(differential_metabolites, file = "statistical_analysis/differential_metabolites.csv")
保存结果对象
object_final <- object save(object_final, file = "statistical_analysis/object_final")
绘制热图
temp_object <- object_final %>%
activate_mass_dataset(what = "sample_info") %>%
filter(sample_id %in% c(control_sample_id, case_sample_id)) %>%
activate_mass_dataset(what = "variable_info") %>%
filter(variable_id %in% differential_metabolites$variable_id) %>%
`+`(1) %>%
log(2) %>%
scale()
library(ComplexHeatmap)
h1 <- HeatmapAnnotation(class = extract_sample_info(temp_object)$group)
pdf(file="statistical_analysis/diff_heatmap.pdf")
massstat::Heatmap(matrix = temp_object,
name = "z-score",
row_names_gp = gpar(cex = 0.4),
column_names_gp = gpar(cex = 0.2),
top_annotation = h1,
row_labels = extract_variable_info(temp_object)$Compound.name,
border = TRUE)
dev.off()
代谢通路富集
导入数据
library(tidymass)
library(tidyverse)
load("statistical_analysis/object_final")
富集
dir.create(path = "pathway_enrichment", showWarnings = FALSE)
diff_metabolites <- object_final %>%
activate_mass_dataset(what = "variable_info") %>%
filter(p_value_adjust < 0.05) %>%
extract_variable_info()
head(diff_metabolites)
加载 KEGG human pathway 数据库
可选数据库:kegg_hsa_pathway,keggMS1database,query_id_kegg,hmdb_pathway,hmdbMS1Database,query_id_hmdb。
??这些数据库的区别是什么?
data("kegg_hsa_pathway", package = "metpath")
kegg_hsa_pathway
get_pathway_class(kegg_hsa_pathway)
# 移除疾病相关通路!根据课题选择是否移除?
pathway_class = metpath::pathway_class(kegg_hsa_pathway)
head(pathway_class)
remain_idx = pathway_class %>% unlist() %>% stringr::str_detect("Disease") %>% `!`() %>% which()
pathway_database = kegg_hsa_pathway[remain_idx]
pathway_database
kegg_id <- diff_metabolites$KEGG.ID
kegg_id <- kegg_id[!is.na(kegg_id)]
result <- enrich_kegg(
query_id = kegg_id,
query_type = "compound",
id_type = "KEGG",
pathway_database = pathway_database,
p_cutoff = 0.05,
p_adjust_method = "BH",
threads = 10)
result
save(result, file = "pathway_enrichment/result")
绘制结果
# bar plot
pdf(file="pathway_enrichment/barplot.pdf")
enrich_bar_plot(object = result, x_axis = "p_value", cutoff = 0.05)
dev.off()
# scatter plot
pdf(file="pathway_enrichment/scatter_plot.pdf")
enrich_scatter_plot(object = result, y_axis = "p_value", y_axis_cutoff = 0.05)
dev.off()
# network
tiff(file="pathway_enrichment/network.tiff")
enrich_network(object = result, point_size = "p_value", p_cutoff = 0.05, only_significant_pathway = TRUE)
dev.off()
Correlation network analysis
temp_object <- object_final %>%
activate_mass_dataset(what = "sample_info") %>%
filter(sample_id %in% c(control_sample_id)) %>%
activate_mass_dataset(what = "variable_info") %>%
filter(variable_id %in% diff_metabolites$variable_id) %>%
`+`(1) %>%
log(2) %>%
scale()
library(ggraph)
library(tidygraph)
graph_data <- convert_mass_dataset2graph(
object = temp_object,
margin = "variable",
cor_method = "spearman",
p_adjust_cutoff = 0.05,
p_value_cutoff = 0.05,
pos_cor_cutoff = 0.7,
neg_cor_cutoff = -0.7
) %>%
mutate(Degree = centrality_degree(mode = 'all'))
library(extrafont)
loadfonts()
library(showtext)
showtext_auto()
plot <-
ggraph(graph = graph_data, layout = "kk") +
geom_edge_fan(aes(color = correlation,
#width = -log((p_adjust+0.1), 10)),
show.legend = TRUE)) +
geom_node_point(aes(size = Degree)) +
shadowtext::geom_shadowtext(aes(x = x, y = y,
label = Compound.name),
bg.colour = "white",
colour = "black")+
theme_graph() +
scale_edge_color_gradient2(low = "darkblue", mid = "white", high = "red")
ggsave(plot, filename = "pathway_enrichment/cor_network.pdf", width = 12, height = 7)
术语
- HILIC(亲水相互作用色谱)& RPLC(反相色谱):RPLC 色谱柱主要使用非极性固定相(C18、C8 等),而 HILIC 色谱柱则使用极性固定相(二氧化硅、酰胺等)。两种技术采用的流动相通常由乙腈和水组成,这使得两种液相色谱模式之间可以实现轻松切换。HILIC 和 RPLC 流动相的主要区别在于溶剂洗脱强度。对于 RPLC,乙腈是强洗脱溶剂。但对于 HILIC,水是强洗脱溶剂。对于 RPLC,得到的色谱图通常是极性分析物到非极性分析物,而 HILIC 则相反。这种相反的洗脱顺序使 HILIC 成为更常用的 RPLC 的一个很好的补充技术。对于极性分析物和离子化分析物尤其如此,它们在 HILIC 模式下的保留时间更长。
参考
加关注
关注公众号 “生信之巅”。
![]() | ![]() |
敬告:使用文中脚本请引用本文网址,请尊重本人的劳动成果,谢谢!