PGCGAP中文说明
发表于:2019-04-28 | 分类: 生物信息
字数统计: 7.2k | 阅读时长: 39分钟 | 阅读量:

为了方便广大中文用户学习PGCGAP的使用,特意书写该中文文档,但中文文档更新较慢,强烈建议大家阅读英文文档!

English Readme | 中文说明

简介

PGCGAP是用于原核生物基因组学和比较基因组学分析管道,目前该管道包含12个模块,可以接受Illumina双端reads、Oxford reads或PacBio reads作为输入,可以完成基因组组装、基因预测和注释,并可以进行比较基因组学分析,包括构建单拷贝核心蛋白进化树以及单拷贝核心基因SNPs进化树,泛基因组分析与进化树构建,全基因组平均核苷酸一致性(ANI)计算,同源蛋白家族聚类及进化树构建,COG注释,SNPs和INDELs calling,抗生素抗性基因/毒力因子预测,Multi-FASTA进化树构建,组装后基因组短序列过滤与统计信息呈现(genome size,GC content……)。

安装

PGCGAP可以安装于Windows子系统Linux(WSL)、Linux x64系统以及macOS中。

Step1:通过Bioconda安装PGCGAP

1
2
3
4
5
6
7
8
$conda create -n pgcgap python=3.7

$conda activate pgcgap

$conda install pgcgap

$conda deactivate

声明:用conda安装时一直在“Solving environment”怎么办?随着加入conda的软件越来越多,使其索引库变的庞大,因此安装新的软件时需要逐一验证软件间的兼容性,会耗费大量时间。严重的时候会无法完成软件的安装,这个时候坐以待毙是不行的。下面介绍两种办法解决此问题:

  • Method 1: 使用mamba (非常快) 替代conda。前提是已经用conda创建好了pgcgap的虚拟环境:

    1
    2
    3
    $conda activate pgcgap
    $conda install mamba -c conda-forge
    $mamba install pgcgap
  • Method 2: 使用本人提供的pgcgap配置文件来创建环境并安装PGCGAP:

    1
    2
    3
    4
    5
    # download pgcgap_latest_env.yml
    $wget https://github.com/liaochenlanruo/pgcgap/blob/master/conda/pgcgap_latest_env.yml

    # create a conda environment named as pgcgap and install the latest version of PGCGAP
    $conda env create -f pgcgap_latest_env.yml

Step2:配置COG数据库 (初次安装PGCGAP后需要执行此步骤)

1
2
3
4
5
$conda activate pgcgap

$pgcgap --setup-COGdb

$conda deactivate

Step3: 升级PGCGAP(升级版本时运行)

1
2
3
4
$conda activate pgcgap
$conda update pgcgap
# v1.0.28以后可通过如下命令升级
$pgcgap --check-update

此外,用户也可以通过容器(docker)安装PGCGAP

1
$docker pull quay.io/biocontainers/pgcgap:<tag>

注: 前提是用户电脑中安装了Docker,Docker可以跨平台使用。可用的tag可在此查询,建议安装最新版。

依赖包

PGCGAP用法

  • 显示帮助信息:

    1
    $pgcgap --help
  • 管道用法:

    1
    $pgcgap [modules] [options]
  • 展示各模块的参数:

    1
    $pgcgap [Assemble|Annotate|ANI|AntiRes|CoreTree|MASH|OrthoF|Pan|pCOG|VAR|STREE|ACC]
  • __展示各模块的运行示例:__(这货是我用的最多的)

    1
    $pgcgap Examples
  • 配置COG数据库: (初次安装PGCGAP后需要配置COG数据库)

    1
    $pgcgap --setup-COGdb
  • 功能模块:

    • [--All] 运行Assemble, Annotate, CoreTree, Pan, OrthoF, ANI, MASH 和 pCOG模块

    • [--Assemble] 基因组组装

    • [--Annotate] 基因预测及注释

    • [--CoreTree] 构建单拷贝核心蛋白进化树与核心SNPs进化树

    • [--Pan] 泛基因组分析并构建单拷贝核心蛋白进化树

    • [--OrthoF] 同源蛋白家族聚类及单拷贝直系同源蛋白进化树构建

    • [--ANI] 计算平均核苷酸一致性 ( ANI )

    • [--MASH] 通过MinHash估算基因组/宏基因组相似性

    • [--pCOG] COG注释

    • [--VAR] 变异检测并构建核心基因组进化树

    • [--AntiRes] 从基因组(contigs/scaffolds)中预测抗生素抗性基因或毒力基因

    • [--STREE] 基于Multi-FASTA序列(所有序列在一个文件中)构建系统发育树

    • [--ACC] 一些实用的附加程序(目前只开发了”Assess”用于对基因组中的短序列进行过滤,并评估过滤前后的基因组状态)

  • 全局参数(请参照英文版,参数有所改变,中文版暂时没有时间修改):

    • [--strain_num (INT)] [Required by “--All”, “--CoreTree”, “--Pan”, “--VAR” and “--COG”] 用于分析的菌株数目,不包含参考基因组

    • [--ReadsPath (PATH)] [Required by “--All”, “--Assemble” and “--VAR”] 所有菌株测序reads所在的目录路径 Default ./Reads/Illumina)

    • [--scafPath (PATH)] [Required by “--All”, “--Assess”, “--Annotate” and “--MASH”] contigs/scaffolds的存放路径 (Default “Results/Assembles/Scaf/Illumina”)

    • [--AAsPath (PATH)] [Required by “--All”, “--CoreTree”, “--OrthoF” and “--pCOG”] 所有菌株的氨基酸序列文件的存放路径 (Default “./Results/Annotations/AAs”)

    • [--reads1 (STRING)] [Required by “--All”, “--Assemble” and “--VAR”] reads 1的后缀名 (例如 reads 1 的名字为 “YBT-1520_L1_I050.R1.clean.fastq.gz”,”YBT-1520” 是菌株名,则后缀名为 “.R1.clean.fastq.gz”)

    • [--reads2 (STRING)] [Required by “--All”, “--Assemble” and “--VAR”] reads 2的后缀名

    • [--Scaf_suffix (STRING)] [Required by “--All”, “--Assess”, “--Annotate” “MASH” and “--ANI”] contigs/scaffolds的后缀名 (Default -8.fa)

    • [--filter_length (INT)] [Required by “--All”, “--Assemble” and “--Assess”]> Sequences shorter than the ‘filter_length’ will be deleted from the assembled genomes. ( Default 200 )

    • [--codon (INT)] [Required by “--All”, “--Annotate”, “--CoreTree” and “--Pan”] 翻译密码子表 (Default 11)

      • 1 Universal code
      • 2 Vertebrate mitochondrial code
      • 3 Yeast mitochondrial code
      • 4 Mold, Protozoan, and Coelenterate Mitochondrial code and Mycoplasma/Spiroplasma code
      • 5 Invertebrate mitochondrial
      • 6 Ciliate, Dasycladacean and Hexamita nuclear code
      • 9 Echinoderm and Flatworm mitochondrial code
      • 10 Euplotid nuclear code
      • 11 Bacterial, archaeal and plant plastid code ( Default )
      • 12 Alternative yeast nuclear code
      • 13 Ascidian mitochondrial code
      • 14 Alternative flatworm mitochondrial code
      • 15 Blepharisma nuclear code
      • 16 Chlorophycean mitochondrial code
      • 21 Trematode mitochondrial code
      • 22 Scenedesmus obliquus mitochondrial code
      • 23 Thraustochytrium mitochondrial code
    • [--suffix_len (INT)] [Required by “--All”, “--Assemble” and “--VAR”] (强烈建议设置此项) reads后缀的长度。例如 “YBT-1520_L1_I050.R1.clean.fastq.gz” 的 --suffix_len 为 26 (“YBT-1520” 为菌株名) (Default 0)

    • [--logs (STRING)] Log文件的名字 (Default Logs.txt)

    • [--threads (INT)] 运行程序时调用的线程数目 (Default 4)


  • 各模块的局部参数:

    • --Assemble

      • [--platform (STRING)] [Required] 测序平台,可以选择 “illumina”, “pacbio” 和 “oxford” (Default illumina)

      • [--assembler (STRING)] [Required] 用于illumina数据组装的软件,可选”abyss”, “spades” 或”auto” ( Default abyss )

      • [--kmmer (INT)] [Required] Illumina数据组装时采用的 k-mer 大小 (Default 81)

      • [--genomeSize (FLOAT)] [Required] 预估的基因组大小,如 3.7m、2.8g,组装 PacBio data 和 Oxford data 时需要设置此项 (Default Unset)

      • [--short1 (STRING)] [Required] FASTQ file of first short reads in each pair. Needed by hybrid assembly ( Default Unset )

      • [--short2 (STRING)] [Required] FASTQ file of second short reads in each pair. Needed by hybrid assembly ( Default Unset )

      • [--long (STRING)] [Required] FASTQ or FASTA file of long reads. Needed by hybrid assembly ( Default Unset )

      • [--hout (STRING)] [Required] Output directory for hybrid assembly ( Default ../../Results/Assembles/Hybrid )

    • --Annotate

      • [--genus (STRING)] 菌株的属名 ( Default “NA” )

      • [--species (STRING)] 菌株的种名 ( Default “NA”)


    • --CoreTree

      • [--CDsPath (PATH)] [Required] 包含所有菌株核苷酸序列文件的路径,如果设置为”NO”,将不会构建核心SNPs进化树 ( Default “./Results/Annotations/CDs” )

      • [-c (FLOAT)] 序列一致性 (identity) 阈值 ( Default 0.5)

      • [-n (INT)] Word_length, -n 2 for thresholds 0.4-0.5, -n 3 for thresholds 0.5-0.6, -n 4 for thresholds 0.6-0.7, -n 5 for thresholds 0.7-1.0 ( Default 2 )

      • [-G (INT)] Use global (set to 1) or local (set to 0) sequence identity, ( Default 0 )

      • [-t (INT)] Tolerance for redundance ( Default 0 )

      • [-aL (FLOAT)] Alignment coverage for the longer sequence. If set to 0.9, the alignment must covers 90% of the sequence ( Default 0.5 )

      • [-aS (FLOAT)] Alignment coverage for the shorter sequence. If set to 0.9, the alignment must covers 90% of the sequence ( Default 0.7 )

      • [-g (INT)] If set to 0, a sequence is clustered to the first cluster that meet the threshold (fast cluster). If set to 1, the program will cluster it into the most similar cluster that meet the threshold (accurate but slow mode, Default 1)

      • [-d (INT)] length of description in .clstr file. if set to 0, it takes the fasta defline and stops at first space ( Default 0 )


    • --Pan

      • [--GffPath (PATH)] [Required] 存放所有菌株GFF3格式文件的路径 ( Default “./Results/Annotations/GFF” )

      • [--identi (INT)] Minimum percentage identity for blastp ( Default 95 )


    - __\-\-OrthoF__
    - __[\-\-Sprogram (STRING)]__        序列对比程序,Options: blast, mmseqs, blast_gz, diamond ( Default blast)
    

    • --ANI

      • [--queryL (FILE)] [Required] The file containing paths to query genomes, one per line ( Default scaf.list )

      • [--refL (FILE)] [Required] The file containing paths to reference genomes, one per line. ( Default scaf.list )

      • [--ANIO (FILE)] The name of output file ( Default “Results/ANI/ANIs” )


    • --VAR

      • [--refgbk (FILE)] [Required] The full path and name of reference genome in GENBANK format ( recommended ), fasta format is also OK. For example: “/mnt/g/test/ref.gbk”

      • [--qualtype (STRING)] [Required] Type of quality values (solexa (CASAVA < 1.3), illumina (CASAVA 1.3 to 1.7), sanger (which is CASAVA >= 1.8)). ( Default sanger )

      • [--qual (INT)] Threshold for trimming based on average quality in a window. ( Default 20 )

      • [--length (INT)] Threshold to keep a read based on length after trimming. ( Default 20 )

      • [--mincov (INT)] The minimum number of reads covering a site to be considered ( Default 10 )

      • [--minfrac (FLOAT)] The minimum proportion of those reads which must differ from the reference ( Default 0.9 )

      • [--minqual (INT)] The minimum VCF variant call “quality” ( Default 100 )

      • [--ram (INT)] Try and keep RAM under this many GB ( Default 8 )

      • [--tree_builder (STRING)] Application to use for tree building [raxml|fasttree|hybrid] ( Default fasttree)

      • [--iterations (INT)] Maximum No. of iterations for gubbins ( Default 5 )


  • --AntiRes

    • [--db (STRING)] [Required] 用于分析的数据库, options: argannot, card, ecoh, ecoli_vf, ncbi, plasmidfinder, resfinder and vfdb. ( Default ncbi )

    • [--identity (INT)] [Required] Minimum %identity to keep the result, should be a number between 1 to 100. ( Default 75 )

    • [--coverage (INT)] [Required] Minimum %coverage to keep the result, should be a number between 0 to 100. ( Default 50 )

  • --STREE

    • [--seqfile (STRING)] [Required] Path of the sequence file for analysis.

    • [--seqtype (INT)] [Required] Type Of Sequence (p, d, c for Protein, DNA, Codons, respectively). ( Default p )

    • [--bsnum (INT)] [Required] Times for bootstrap. ( Default 1000 )

  • --ACC

    • [--Assess (STRING)] Filter short sequences in the genome and assess the status of the genome(详细参数通过”pgcgap ACC”查看)
  • 依赖软件安装目录

    Not needed if they were in the environment variables path. Users can check with the “–check-external-programs” option for the essential programs.


  • [--abricate-bin (PATH)] Path to abyss binary file. Default tries if abyss is in PATH;

  • [--abyss-bin (PATH)] Path to abyss binary file. Default tries if abyss is in PATH;

  • [--canu-bin (PATH)] Path to canu binary file. Default tries if canu is in PATH;

  • [--cd-hit-bin (PATH)] Path to cd-hit binary file. Default tries if cd-hit is in PATH;

  • [--fastANI-bin (PATH)] Path to the fastANI binary file. Default tries if fastANI is in PATH;

  • [--Gblocks-bin (PATH)] Path to the Gblocks binary file. Default tries if Gblocks is in PATH;

  • [--gubbins-bin (PATH)] Path to the run_gubbins.py binary file. Default tries if run_gubbins.py is in PATH;

  • [--iqtree-bin (PATH)] Path to the iqtree binary file. Default tries if iqtree is in PATH;

  • [--mafft-bin (PATH)] Path to mafft binary file. Default tries if mafft is in PATH;

  • [--mash-bin (PATH)] Path to the mash binary file. Default tries if mash is in PATH.

  • [--modeltest-ng-bin (PATH)] Path to the modeltest-ng binary file. Default tries if modeltest-ng is in PATH.

  • [--muscle-bin (PATH)] Path to the muscle binary file. Default tries if muscle is in PATH.

  • [--orthofinder-bin (PATH)] Path to the orthofinder binary file. Default tries if orthofinder is in PATH;

  • [--pal2nal-bin (PATH)] Path to the pal2nal.pl binary file. Default tries if pal2nal.pl is in PATH;

  • [--prodigal-bin (PATH)] Path to prodigal binary file. Default tries if prodigal is in PATH;

  • [--prokka-bin (PATH)] Path to prokka binary file. Default tries if prokka is in PATH;

  • [--raxml-ng-bin (PATH)] Path to the raxml-ng binary file. Default tries if raxml-ng is in PATH;

  • [--roary-bin (PATH)] Path to the roary binary file. Default tries if roary is in PATH;

  • [--sickle-bin (PATH)] Path to the sickle-trim binary file. Default tries if sickle is in PATH.

  • [--snippy-bin (PATH)] Path to the snippy binary file. Default tries if snippy is in PATH;

  • [--snp-sites-bin (PATH)] Path to the snp-sites binary file. Default tries if snp-sites is in PATH;

  • [--unicycler-bin (PATH)] Path to the unicycler binary file. Default tries if unicycler is in PATH;


  • 配置COG数据库

    • [--setup-COGdb] 首次安装PGCGAP后需要执行此步

  • 检查依赖软件包是否安装 (强烈建议在安装完PGCGAP之后运行此步):

    1
    $pgcgap --check-external-programs

示例

  • Example 1: 执行所有模块,以 Escherichia coli 的6个 Illumina 双端 reads 为数据集。

    __注__:为了提高灵活性,”VAR” 模块需要额外添加。

    1
    $pgcgap --All --platform illumina --filter_length 200 --ReadsPath Reads/Illumina --reads1 _1.fastq.gz --reads2 _2.fastq.gz --suffix_len 11 --kmmer 81 --genus Escherichia --species “Escherichia coli” --codon 11 --strain_num 6 --threads 4 --VAR --refgbk /mnt/h/PGCGAP_Examples/Reads/MG1655.gbff --qualtype sanger
  • Example 2: 基因组组装。

    • Illumina双端reads组装

      该数据集中,reads的命名格式为 “strain_1.fastq.gz” 和 “strain_2.fastq.gz”。 后缀名为 “_1.fastq.gz”,其长度为11,因此 “–suffix_len” 设置为11。

    1
    2
    3
    4
    5
    $pgcgap --Assemble --platform illumina --assembler abyss --filter_length 200 --ReadsPath Reads/Illumina --reads1 _1.fastq.gz --reads2 _2.fastq.gz --kmmer 81 --threads 4 --suffix_len 11

    $pgcgap --Assemble --platform illumina --assembler spades --filter_length 200 --ReadsPath Reads/Illumina --reads1 _1.fastq.gz --reads2 _2.fastq.gz --threads 4 --suffix_len 11

    $pgcgap --Assemble --platform illumina --assembler auto --filter_length 200 --ReadsPath Reads/Illumina --reads1 _1.fastq.gz --reads2 _2.fastq.gz --kmmer 81 --threads 4 --suffix_len 11
    • Oxford reads组装

      Oxford nanopore 测序仅产生一个reads文件,因此只需要设置 “–reads1” 参数,其值为 “.fasta”。 “–genomeSize” 是预估的基因组大小,用户可以到NCBI数据库中查看同物种基因组的大小作为参考,此处设置为 “4.8m”。Reads 文件的后缀名为 “.fasta”,其长度为6,因此将 “–suffix_len” 设置为6。

    1
    $pgcgap --Assemble --platform oxford --filter_length 200 --ReadsPath Reads/Oxford --reads1 .fasta --genomeSize 4.8m --threads 4 --suffix_len 6
    • PacBio reads组装

    PacBio同样只产生一个文件 “pacbio.fastq”,参数设置与Oxford类似。此处,文件的后缀名为 “.fastq”,其长度为6,因此 “–suffix_len” 设置为6。

    1
    $pgcgap --Assemble --platform pacbio --filter_length 200 --ReadsPath Reads/PacBio --reads1 .fastq --genomeSize 4.8m --threads 4 --suffix_len 6
  • Example 3: 基因预测及注释。

    1
    $pgcgap --Annotate --scafPath Results/Assembles/Scaf/Illumina --Scaf_suffix -8.fa --genus Escherichia --species “Escherichia coli” --codon 11 --threads 4
  • Example 4: 构建单拷贝核心蛋白进化树与核心SNPs进化树。

    1
    2
    3
    4
    5
    6
    7
    8
       # Construct phylogenetic tree with FastTree (Quick without best fit model testing)
    $pgcgap --CoreTree --CDsPath Results/Annotations/CDs --AAsPath Results/Annotations/AAs --codon 11 --strain_num 6 --threads 4 --fasttree

    # Construct phylogenetic tree with IQ-TREE (Very slow with best fit model testing, traditional bootstrap)
    $pgcgap --CoreTree --CDsPath Results/Annotations/CDs --AAsPath Results/Annotations/AAs --codon 11 --strain_num 6 --threads 4 --bsnum 500

    # Construct phylogenetic tree with IQ-TREE (Slow with best fit model testing, ultrafast bootstrap)
    $pgcgap --CoreTree --CDsPath Results/Annotations/CDs --AAsPath Results/Annotations/AAs --codon 11 --strain_num 6 --threads 4 --fastboot 1000
  • Example 5: 仅构建单拷贝核心蛋白进化树。

    1
    2
    3
    4
    5
    6
    7
    8
       # Construct phylogenetic tree with FastTree (Quick without best fit model testing)
    $pgcgap --CoreTree --CDsPath NO --AAsPath Results/Annotations/AAs --codon 11 --strain_num 6 --threads 4 --fasttree

    # Construct phylogenetic tree with IQ-TREE (Very slow with best fit model testing, traditional bootstrap)
    $pgcgap --CoreTree --CDsPath NO --AAsPath Results/Annotations/AAs --codon 11 --strain_num 6 --threads 4 --bsnum 500

    # Construct phylogenetic tree with IQ-TREE (Slow with best fit model testing, ultrafast bootstrap)
    $pgcgap --CoreTree --CDsPath NO --AAsPath Results/Annotations/AAs --codon 11 --strain_num 6 --threads 4 --fastboot 1000
  • Example 6: 进行泛基因组分析并构建单拷贝核心蛋白进化树。

    1
    2
    3
    4
    5
    6
    7
    8
       # Construct phylogenetic tree with FastTree (Quick without best fit model testing)
    $pgcgap --Pan --codon 11 --identi 95 --strain_num 6 --threads 4 --GffPath Results/Annotations/GFF --PanTree --fasttree

    # Construct phylogenetic tree with IQ-TREE (Very slow with best fit model testing, traditional bootstrap)
    $pgcgap --Pan --codon 11 --identi 95 --strain_num 6 --threads 4 --GffPath Results/Annotations/GFF --PanTree --bsnum 500

    # Construct phylogenetic tree with IQ-TREE (Slow with best fit model testing, ultrafast bootstrap)
    $pgcgap --Pan --codon 11 --identi 95 --strain_num 6 --threads 4 --GffPath Results/Annotations/GFF --PanTree --fastboot 1000
  • Example 7: 同源蛋白家族聚类分析并构建进化树。

    1
    2
    3
    4
    5
    6
    7
    8
       # Construct phylogenetic tree with FastTree (Quick without best fit model testing)
    $pgcgap --OrthoF --threads 4 --AAsPath Results/Annotations/AAs --fasttree

    # Construct phylogenetic tree with IQ-TREE (Very slow with best fit model testing, traditional bootstrap)
    $pgcgap --OrthoF --threads 4 --AAsPath Results/Annotations/AAs --bsnum 500

    # Construct phylogenetic tree with IQ-TREE (Slow with best fit model testing, ultrafast bootstrap)
    $pgcgap --OrthoF --threads 4 --AAsPath Results/Annotations/AAs --fastboot 1000
  • Example 8: 计算两两基因组之间的平均核苷酸一致性(ANI)。

    1
    $pgcgap --ANI --threads 4 --queryL scaf.list --refL scaf.list --ANIO Results/ANI/ANIs --Scaf_suffix .fa
  • Example 9: 通过MinHash计算基因组及宏基因组的相似性。

    1
    $pgcgap --MASH --scafPath <PATH> --Scaf_suffix <STRING>
  • Example 10: 对所有基因组进行COG注释。

    1
    $pgcgap --pCOG --threads 4 --strain_num 6 --AAsPath Results/Annotations/AAs
  • Example 11: 变异(SNPs, Indels)检测与注释,并构建基于参考基因组的SNPs进化树。

    1
    2
    3
    4
    5
       # Construct phylogenetic tree with IQ-TREE (Very slow with best fit model testing, traditional bootstrap)
    $pgcgap --VAR --threads 4 --refgbk /mnt/h/PGCGAP_Examples/Reads/MG1655.gbff --ReadsPath Reads/Illumina --reads1 _1.fastq.gz --reads2 _2.fastq.gz --suffix_len 11 --strain_num 6 --qualtype sanger --bsnum 500

    # Construct phylogenetic tree with IQ-TREE (Slow with best fit model testing, ultrafast bootstrap)
    $pgcgap --VAR --threads 4 --refgbk /mnt/h/PGCGAP_Examples/Reads/MG1655.gbff --ReadsPath Reads/Illumina --reads1 _1.fastq.gz --reads2 _2.fastq.gz --suffix_len 11 --strain_num 6 --qualtype sanger --fastboot 1000
  • Example 12: 从基因组中寻找抗生素抗性基因或毒力基因。

    1
    $pgcgap --AntiRes --scafPath Results/Assembles/Scaf/Illumina --Scaf_suffix -8.fa --threads 6 --db ncbi --identity 75 --coverage 50
  • Example 13: Filter short sequences in the genome and assess the status of the genome

    1
    $pgcgap --ACC --Assess --scafPath Results/Assembles/Scaf/Illumina --Scaf_suffix -8.fa --filter_length 200
  • Example 14: Construct a phylogenetic tree based on multiple sequences in one file

    1
    2
    3
    4
    5
       # Construct phylogenetic tree with IQ-TREE (Very slow with best fit model testing, traditional bootstrap)
    $pgcgap --STREE --seqfile proteins.fas --seqtype p --bsnum 500 --threads 4

    # Construct phylogenetic tree with IQ-TREE (Slow with best fit model testing, ultrafast bootstrap)
    $pgcgap --STREE --seqfile proteins.fas --seqtype p --fastboot 1000 --threads 4

准备输入文件

工作目录

  • PGCGAP的运行目录。

Assemble

  • 将所有双端reads、或PacBio reads 或 Oxford nanopore reads存于某个目录下 (Default: ./Reads/Illumina/)。

Annotate

  • 基因组文件(完整或不完整)存放于某个目录下 (Default: Results/Assembles/Scaf/Illumina)。

ANI

  • QUERY_LIST 和 REFERENCE_LIST 文件,每个文件中含有需要计算的基因组的绝对路径,每个基因组一行 (default: 工作目录下的”scaf.list”)。若先运行了 “–Assemble” 模块,该文件会自动生成。

CoreTree

  • 将所有菌株的氨基酸文件 (后缀名必须为 “.faa”) 和核苷酸文件 (需以 “.ffn” 为后缀) 分别存放于两个目录中 (default: “./Results/Annotations/AAs/” 和 “./Results/Annotations/CDs/”)。“.faa” 和 “.ffn” 文件需要有相同的前缀名字,且 protein IDs 和 gene IDs 需以菌株名开头。建议用 “Prokka” 软件获取输入文件,若已经运行了 “–Annotate” 模块,则该模块的输入文件会自动生产。若 “–CDsPath” 设置为 “NO”,则不需要提供核苷酸文件,但也不会生产核心SNPs进化树。

MASH

  • 基因组文件(完整或不完整)存放于某个目录下 (Default: Results/Assembles/Scaf/Illumina)。

OrthoF

  • 所有菌株的fasta格式氨基酸文件(每个菌株一个文件)存放于一个目录中 (default: “./Results/Annotations/AAs/”)。 若先运行了 “–Annotate” 模块,该文件会自动生成。

Pan

  • 包含所有菌株GFF3 文件 (With “.gff” as the suffix) 的目录路径 (default: ./Results/Annotations/GFF/);
  • 若先运行了 “–Annotate” 模块,上述文件会自动生成。

pCOG

  • 存放所有菌株的fasta格式氨基酸序列文件 (With “.faa” as the suffix) 的目录路径 (default: ./Results/Annotations/AAs/)。 若先运行了 “–Annotate” 模块,该文件会自动生成。

VAR

  • 包含所有菌株的 Pair-end reads 的目录路径 (default: ./Reads/Over/ under the working directory)。
  • fasta 格式或 GenBank 格式的参考基因组的绝对路径 (必需提供)。

AntiRes

  • 存放基因组(complete or draft)的目录 (Default: Results/Assembles/Scaf/Illumina under the working directory).

STREE

  • Multiple-FASTA sequences in a file, can be Protein, DNA and Codons.

输出文件解读

Assemble

  • Results/Assembles/Illumina/

    Directories contain Illumina assembly files and information of each strain.

  • Results/Assembles/PacBio/

    Directories contain PacBio assembly files and information of each strain.

  • Results/Assembles/Oxford/

    Directories contain ONT assembly files and information of each strain.

  • Results/Assembles/Hybrid/

    Directory contains hybrid assembly files of the short reads and long reads of the same strain.

  • Results/Assembles/Scaf/Illumina

    Directory contains Illumina contigs/scaffolds of all strains. “*.filtered.fas” is the genome after excluding short sequences. “*.prefilter.stats” describes the stats of the genome before filtering, and “*.filtered.stats” describes the stats of the genome after filtering.

  • Results/Assembles/Scaf/Oxford

    Directory contains Oxford nanopore contigs/scaffolds of all strains.

  • Results/Assembles/Scaf/PacBio

    Directory contains PacBio contigs/scaffolds of all strains.

Annotate

  • Results/Annotations/*_annotation

    directories contain annotation files of each strain.

  • Results/Annotations/AAs

    Directory contain amino acids sequences of all strains.

  • Results/Annotations/CDs

    Directory contain nucleotide sequences of all strains.

  • Results/Annotations/GFF

    Directory contain the master annotation of all strains in GFF3 format.

ANI

  • Results/ANI/ANIs

    The file contains comparation information of genome pairs. The document is composed of five columns, each of which represents query genome, reference genome, ANI value, count of bidirectional fragment mappings, total query fragments.

  • Results/ANI/ANIs.matrix

    file with identity values arranged in a phylip-formatted lower triangular matrix.

  • Results/ANI/ANIs.heatmap

    An ANI matrix of all strains.

  • Results/ANI/ANI_matrix.pdf

    The heatmap plot of “ANIs.heatmap”.

MASH

  • Results/MASH/MASH

    The pairwise distance between pair genomes, each column represents Reference-ID, Query-ID, Mash-distance, P-value, and Matching-hashes, respectively.

  • Results/MASH/MASH2

    The pairwise similarity between pair genomes, each column represents Reference-ID, Query-ID, similarity, P-value, and Matching-hashes, respectively.

  • Results/MASH/MASH.heatmap

    A similarity matrix of all genomes.

  • Results/MASH/MASH_matrix.pdf

    A heat map plot of “MASH.heatmap”.

CoreTree

  • Results/CoreTrees/ALL.core.protein.fasta

    Concatenated and aligned sequences file of single-copy core proteins.

  • Results/CoreTrees/ALL.core.protein.nwk

    The phylogenetic tree file of single-copy core proteins for all strains constructed by FastTree.

  • Results/CoreTrees/ALL.core.protein.fasta.gb.treefile

    The phylogenetic tree file of single-copy core proteins for all strains constructed by IQ-TREE.

  • Results/CoreTrees/faa2ffn/ALL.core.nucl.fasta

    Concatenated and aligned sequences file of single-copy core genes.

  • Results/CoreTrees/ALL.core.snp.fasta

    Core SNPs of single-copy core genes in fasta format.

  • Results/CoreTrees/ALL.core.snp.nwk

    The phylogenetic tree file of SNPs of single-copy core genes for all strains constructed by FastTree.

  • Results/CoreTrees/ALL.core.snp.fasta.gb.treefile

    The phylogenetic tree file of SNPs of single-copy core genes for all strains constructed by IQ-TREE

  • Results/CoreTrees/“Other_files”

    Intermediate directories and files.

OrthoF

  • Results/OrthoFinder/Results_orthoF

    Same as OrthoFinder outputs.

  • Results/OrthoFinder/Results_orthoF/Single_Copy_Orthologue_Tree/

    Directory contains Phylogenetic tree files based on Single Copy Orthologue sequences.

  • Results/OrthoFinder/Results_orthoF/Single_Copy_Orthologue_Tree/Single.Copy.Orthologue.nwk

    Phylogenetic tree constructed by FastTree.

  • Results/OrthoFinder/Results_orthoF/Single_Copy_Orthologue_Tree/Single.Copy.Orthologue.fasta.gb.treefile

    Phylogenetic tree constructed by IQ-TREE.

Pan

  • Results/PanGenome/Pangenome_Pie.pdf

    A 3D pie chart and a fan chart of the breakdown of genes and the number of isolates they are present in.

  • Results/PanGenome/pangenome_frequency.pdf

    A graph with the frequency of genes versus the number of genomes.

  • Results/PanGenome/Pangenome_matrix.pdf

    A figure showing the tree compared to a matrix with the presence and absence of core and accessory genes.

  • Results/PanGenome/Core/Roary.core.protein.fasta

    Alignments of single-copy core proteins called by roary software.

  • Results/PanGenome/Core/Roary.core.protein.nwk

    A phylogenetic tree of Roary.core.protein.fasta constructed by FastTree.

  • Results/PanGenome/Core/Roary.core.protein.fasta.gb.treefile

    A phylogenetic tree of Roary.core.protein.fasta constructed by IQ-TREE.

  • Results/PanGenome/Other_files

    see roary outputs.

pCOG

  • *.COG.xml, *.2gi.table, *.2id.table, *.2Sid.table

    Intermediate files.

  • *.2Scog.table

    The super COG table of each strain.

  • *.2Scog.table.pdf

    A plot of super COG table in pdf format.

  • All_flags_relative_abundances.table
    A table containing the relative abundance of each flag for all strains.

VAR

  • Results/Variants/directory-named-in-strains

    directories containing substitutions (snps) and insertions/deletions (indels) of each strain. See Snippy outputs for detail.

  • Results/Variants/Core

    The directory containing SNP phylogeny files.

    • core.aln : A core SNP alignment includes only SNP sites.
    • core.full.aln : A whole genome SNP alignment (includes invariant sites).
    • core.*.treefile : Phylogenetic tree of the core SNP alignment based on the best-fit model of evolution selected using IQ-TREE (ignoring possible recombination).
    • gubbins.core.full.node_labelled.final_tree.tre : Phylogenetic tree of the whole genome SNP alignment constructed with gubbins (get rid of recombination).

AntiRes

  • Results/AntiRes/*.tab : Screening results of each strain.
  • Results/AntiRes/summary.txt : A matrix of gene presence/absence for all strains.

STREE

  • Results/STREE/*.aln : Aligned sequences.
  • Results/STREE/*.aln.gb : Conserved blocks of the sequences.
  • Results/STREE/*.aln.gb.treefile : The final phylogenetic tree.

使用许可

PGCGAP 不可商用,其它情况可使用免费(licensed under GPLv3)。

反馈与提问

若有问题,可在 issues page 提出或通过邮件咨询 liaochenlanruo@webmail.hzau.edu.cn

引用

常见问题

Q1 VAR founction ran failed to get annotated VCFs and Core results

Check the log file named in “strain_name.log” under Results/Variants// directory. If you find a sentence like “WARNING: All frames are zero! This seems rather odd, please check that ‘frame’ information in your ‘genes’ file is accurate.” This is an snpEff error. Users can install JDK8 to solve this problem.

1
$conda install java-jdk=8.0.112

Click here for more solutions.

Q2 Could not determine version of minced please install version 2 or higher

When running prokka of Assemble founction, this error could happened, the error message shows as following:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Error: A JNI error has occurred, please check your installation and try again
Exception in thread "main" java.lang.UnsupportedClassVersionError: minced has been compiled by a more recent version of the Java Runtime (class file version 55.0), this version of the Java Runtime only recognizes class file versions up to 52.0
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:468)
at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:495)
[01:09:40] Could not determine version of minced - please install version 2.0 or higher

Users can downgrade the minced to version 0.3 to solve this problem.

1
$conda install minced=0.3

Click here for detail informations.

Q3 dyld: Library not loaded: @rpath/libcrypto.1.0.0.dylib

This error may happen when running function “VAR” on macOS. It is an error of openssl. Users can solve this problem as following:

1
2
3
4
5
6
7
8
9
10
#Firstly, install brew if have not installed before
$ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

#Install openssl with brew
$brew install openssl

#Create the soft link for libraries
$ln -s /usr/local/opt/openssl/lib/libcrypto.1.0.0.dylib /usr/local/lib/

$ln -s /usr/local/opt/openssl/lib/libssl.1.0.0.dylib /usr/local/lib/

Click here for more informations

Q4 Use of uninitialized value in require at Encode.pm line 61

This warning may happen when running function “Pan”. It is a warning of Roary software.
The content of line 61 is “require Encode::ConfigLocal;”. Users can ignore the warning.
Click here for details.

Updates

  • V1.0.3

    • Updated ANI fuction.
  • V1.0.4

    • Add parallel for function “COG”.
    • Optimized drawing of ANI heat map.
  • V1.0.5

    • Bug repair for input of gubbins.
  • V1.0.6

    • Modified CoreTree to split protein and SNPs tree constructing.
  • V1.0.7

    • Split Assemble and Annotate into two functions.
    • Added third generation genome assembly function.
    • Changed the default parameters of the CoreTree function (aS 0.8 to 0.7 and aL 0.8 to 0.5).
    • Changed the name of function “COG” to “pCOG”.
    • Fixed the sorting bug for ANI heat map.
  • V1.0.8

    • Add the “MASH” function to compute genome distance and similarity using MinHash.
  • V1.0.9

    • The function of constructing a single-copy core protein phylogenetic tree was added to “Pan”.
    • Fixed a bug of plot_3Dpie.R, Optimized image display, and a fan chart has been added.
    • Fixed a bug for ploting the ANI matrix.
  • V1.0.10

    • Add the “AntiRes” function to screening of contigs for antimicrobial and virulence genes.
  • V1.0.11

    • Users now can choose “abyss” or “spades” for illumina reads aseembly.
    • New support for hybrid assembly of paired-end short reads and long reads.
    • Add the selecting of best-fit model of evolution for DNA and protein alignments before constructing a phylogenetic tree.
    • Optimized display of help information. Users can check parameters for each modulewith command “pgcgap [Assemble|Annotate|ANI|AntiRes|CoreTree|MASH|OrthoF|Pan|pCOG|VAR]“, and can look up the examples of each module with command “pgcgap Examples”.
  • V1.0.12

    • Added automatic mode for illumina genome assembly. First, PGCGAP calls “ABySS” for genome assembly. When the assembled N50 is less than 50,000, it automatically calls “SPAdes” to try multiple parameters for assembly.
    • Added ability to filter short sequences of assembled genomes.
    • Added function of genome assembly status assessment.
    • Modified the drawing script of ANI and MASH modules so that it can automatically adjust the font size according to the number of samples.
  • V1.0.13

    • Fixed the “running error” bug of function “Assess” in module “ACC”.
    • Added module “STREE” for constructing a phylogenetic tree based on multiple sequences in one file.
  • V1.0.14

    • The relative_abundances of flags among strains will not be called while the strain number is less than two.
  • V1.0.15

    • When the number of threads set by the user exceeds the number of threads owned by the system, PGCGAP will automatically adjust the number of threads to avoid program crash.
    • Add FASTQ preprocessor before Illunima genome assembly: adapter trimming, polyG tail trimming of Illumina NextSeq/NovaSeq reads, quality filtering (Q value filtering, N base filtering, sliding window filtering), length filtering.
  • V1.0.16

    • Reduced the number of Racon polishing rounds for better speed performance when peforming genome assembly.
    • Force overwriting existing output folder when running “Annotate” analysis to avoid program crash.
  • V1.0.17

    • Fixed a bug that the program can not go back to the working directory after genome annotation.
    • Added scripts to check if there were single-copy core proteins found while running module “CoreTree”.
    • Modified the help message.
  • V1.0.18

    • Updated the downloading link of COG database.
    • Users can choose the number of threads used for running module “STREE”.
  • V1.0.19

    • Can resume from break-point when downloading the COG database.
    • Fixed a bug that failed to create multi-level directories.
  • V1.0.20

    • Fixed a little bug (path error) of module “VAR”.
    • Fixed a little bug of module “CoreTree” to avoid the interference of special characters in sequence ID to the program.
  • V1.0.21

    • Change the default search program “blast” to “diamond” of module “OrthoF”.
    • Fixed a bug of module “pCOG” to output the right figure.
  • V1.0.22

    • The drawing function of module “ANI” and module “MASH” has been enhanced, including automatic adjustment of font size and legend according to the size of the picture.
    • Fixed a bug of module “ANI”, that is no heatmap will be drawn when there is “NA” in the ANI matrix in the previous versions.
    • When the ANI value or genome similarity is greater than 95%, an asterisk (*) will be drawn in the corresponding cell of the heatmap.
  • V1.0.23

    • The “–Assess” function of module “ACC” was enhanced to (1) generate a summary file containing the status of all genomes (before and after the short sequence filtering), (2) auto move the low-quality genomes (that is genomes with N50 length less than 50 k) to a directory, and others to another directory.
  • V1.0.24

    • Fixed a little bug of module “Pan” to avoid the interference of special characters (>) in sequence ID to the program.
  • V1.0.25

    • Gblocks was used to eliminate poorly aligned positions and divergent regions of an alignment of DNA or protein sequences in module “CoreTree” and “Pan”.
    • The parameter “–identi” was added into module “Pan” to allow users to set the minimum percentage identity for blastp.
  • V1.0.26

    • Adjusted the font size with the variation of genome number and the string length of the genome name when plotting the heat map of module “ANI” and “MASH”.
    • Two heat map are provided, one of which with a star (means the similarity of the two genomes is larger than 95%) and another without a star, when performing the “ANI” and “MASH” analysis.
  • V1.0.27

    • The Amino Acid files are no longer needed when performing the Pan-genome analysis with module Pan.
  • V1.0.28

    • Users can check and install the latest version of PGCGAP by the command “pgcgap –check-update”.
    • Update module Assemble to allow polish after the assembly of PacBio and ONT data.
    • Update module pCOG to adjust the latest database of COG 2020.
    • Optimized the drawing and color scheme of the module pCOG.
    • Fixed the parameter “CoreTree” in the module Pan to avoid program termination caused by the “>” in non-sequence lines.
  • V1.0.29

    • Function added to module OrthoF: Phylogenetic tree can be constructed automatically with the Single Copy Orthologue Sequences called by module OrthoF.
    • Fixed the “permission denied” error when moving directories on the WSL platform.
  • V1.0.30

  • Replace Gblocks with trimAL to trim MSA (module CoreTree, Pan, STREE, and OrthoF).

  • Replaced Modeltest-ng and Raxml-ng with IQ-TREE (module CoreTree, Pan OrthoF, and VAR).

  • Added the option of using FastTree to build phylogenetic tree (module CoreTree, Pan, and OrthoF).

  • V1.0.31

    • The default replicates for bootstrap testing was set to 500.
    • Add the method for phylogenetic tree constructing with ultrafast bootstrap of IQ-TREE.
    • Prevent the log from being written to the tree file generated by FastTree.
  • V1.0.32

    • A more colorful version, try “pgcgap Examples” to have a look.
    • Updated module AntiRes: the parameter –db had been modified to add choices of “all” and “megares”.
    • A little optimization of module VAR.
    • Replaced conda with mamba to update PGCGAP more quickly.
上一篇:
Bioconda贡献指南
下一篇:
Perl获取外部命令执行结果的输出