SAMBA: Standardized and Automated MetaBarcoding Analyses workflow



Description of the workflow


The SAMBA workflow, developed by the SeBiMER (Ifremer's Bioinformatics Core Facility) is an open-source modular workflow to process eDNA metabarcoding data. SAMBA is developped using the NextFlow workflow manager (Di Tommaso et al., 2017). SAMBA is built around three main parts: data integrity checking, bioinformatics processes and statistical analyses. The SAMBA checking process allows to verify the integrity of the raw data. All bioinformatics processes are mainly based on the use of the next-generation microbiome bioinformatics platform QIIME 2 (Bolyen et al., 2019 ; version 2020.2) and on the approach of grouping sequences in ASV (Amplicon Sequence Variants) using DADA2 (Callahan et al., 2016). It also performs extensives analyses of the alpha- and beta-diversity using homemade R scripts (R CORE TEAM, 2020). SAMBA offers a real alternative to the complex use of a suite of command line bioinformatics tools while providing access to state-of-the-art methods and tools in the field. The SAMBA source code, documentation and installation instructions are freely available at SeBiMER GitHub.


In order to achieve the complete analysis, SAMBA uses a range of state-of-the-art software and methods which you will find the list below :

Tools and softwares Version
SAMBA v3.0.1
Required
Nextflow 20.04.1
Included in SAMBA
QIIME 2 2019.10.0
R 3.6.1
DESeq2 1.26.0
metagenomeSeq 1.28.0
microbiome 1.8.0
phyloseq 1.30.0
vegan 2.5.6
UpSetR 1.4.0

Bioinformatic process

Data integrity [optional]


This first step allows to analyze the integrity of your data in order to identify potential problems related to the sequencing processes. It checks:
  • that each read is correctly associated with the proper sample (sequence barcode verification)
  • that they come from a single sequencer
  • the efficiency of forward and reverse PCR amplification
This step is carried out by a homemade script

Results


Importing raw data


The step performs the import of sequencing data directly from DATAREF into a QIIME 2 specific format. In addition, descriptive statistics of your data are generated . The SAMBA workflow ran for you the following commands

Results

Output folder



Primers removal


The third step is to remove the primers using the cutadapt plugin available in QIIME 2. In addition, descriptive statistics of your data are generated. The following commands made this step possible using the parameters that you have defined in the parameters configuration file :

User-defined Cutadapt parameters
Results

Output folder




Sequence quality control and feature table construction


The following step of the workflow is to filter the quality of the sequences according to the parameters defined yourself by having considered the quality of your data in the previous step. Also to assemble the forward and reverse sequences, and to identify and remove the chimeras. To performed this step, the DADA2 R package was used through QIIME 2 using the following commands and your parameters :

User-defined DADA2 parameters
Results

Output folder


ASV clustering [optional]


The following step of the workflow is to cluster ASV according to sequence similarity and abundance profil. This allow to takes into account the overestimation of diversity produced by DADA2. It also reduces possible PCR errors. To performed this step, the dbOTU3 algorithm was used through QIIME 2 using the following commands and the parameters :

User-defined dbOTU3 parameters
Results

Output folder


Taxonomic assignation


The taxonomic assignation performed during the fourth step of the workflow allowed to affiliate each ASV to a taxonomy by using as reference the database defined by yourself.Commands executed

User-defined QIIME 2 parameters
Results

Output folder


Functional predictions (PICRUSt2) [optional]


This optional step allows you to predic the functional potential of the communities using PICRUSt2

User-defined PICRUSt2 parameters
Results

Output folder


Differential abundance testing using ANCOM


This step allows you to test if there are differentially abundant ASVs depending on the variable of interest that you have specified using the following commands :

Results

Output folder

List of all tested variables :

ANCOM analysis based on the sample_species variable :
ANCOM analysis based on the source variable :

Final outputs


This is the last step of the bioinformatic process where the goal is to merge the ASV abundance table with the taxonomy file

Results


General statistical analyses

Alpha diversity [optional]


  • Tested variables :
  • Diversity index used : "Observed richness / Chao1 / Shannon / InvSimpson / Pielou"

  • Diversity indices


    sample_species

    source

    Rarefaction curve


    Taxonomic diversity





    Beta diversity by index [optional]


  • Normalization method : " None / Rarefaction / DESeq2 / CSS"
  • Tested variables :
  • Distance matrix : "Jaccard / Bray-Curtis / UniFrac / Weighted UniFrac"

  • Beta diversity by variable [optional]


    Ordination and hierarchical clustering for variable: source



    NMDS





    PCoA





    hclustering





    NMDS





    PCoA





    hclustering





    NMDS





    PCoA





    hclustering




    Ordination and hierarchical clustering for variable: sample_species



    NMDS





    PCoA





    hclustering





    NMDS





    PCoA





    hclustering





    NMDS





    PCoA





    hclustering




    Explained variance





    Descriptive comparison [optional]


  • Tested variables :

  • Repartition of ASVs in the samples grouped by sample_species:


  • Repartition of ASVs in the samples grouped by source:



  • About

    Contributors

    SAMBA is developped by Ifremer's Bioinformatics Core Facility (SeBiMER) (French Research Institute for Exploitation of the Sea).

    Citations

    You can cite the NextFlow publication as follows:

    Paolo Di Tommaso, Maria Chatzou, Evan W. Floden, Pablo Prieto Barja, Emilio Palumbo & Cedric Notredame. (2017). Nextflow enables reproducible computational workflows. Nature biotechnology. 35(4), 316-319.

    You can cite each tool used in SAMBA as follows:

    Benjamin J. Callahan, Paul J. McMurdie, Michael J. Rosen, Andrew W. Han, Amy Jo A. Johnson & Susan P. Holmes. (2016). DADA2: high-resolution sample inference from Illumina amplicon data Nature methods. 13(7), 581.

    Donald T. McKnight, Roger Huerlimann, Deborah S. Bower, Lin Schwarzkopf, Ross A. Alford & Kyall R. Zenger. (2019). microDecon: A highly accurate read‐subtraction tool for the post‐sequencing removal of contamination in metabarcoding studies. Environmental DNA. 1(1), 14-25.

    Evan Bolyen, Jai Ram Rideout, Matthew R. Dillon, Nicholas A. Bokulich, Christian C. Abnet, Gabriel A. Al-Ghalith, Harriet Alexander, Eric J. Alm, Manimozhiyan Arumugam, Francesco Asnicar, Yang Bai, Jordan E. Bisanz, Kyle Bittinger, Asker Brejnrod, Colin J. Brislawn, C. Titus Brown, Benjamin J. Callahan, Andrés Mauricio Caraballo-Rodríguez, John Chase, Emily K. Cope, Ricardo Da Silva, Christian Diener, Pieter C. Dorrestein, Gavin M. Douglas, Daniel M. Durall, Claire Duvallet, Christian F. Edwardson, Madeleine Ernst, Mehrbod Estaki, Jennifer Fouquier, Julia M. Gauglitz, Sean M. Gibbons, Deanna L. Gibson, Antonio Gonzalez, Kestrel Gorlick, Jiarong Guo, Benjamin Hillmann, Susan Holmes, Hannes Holste, Curtis Huttenhower, Gavin A. Huttley, Stefan Janssen, Alan K. Jarmusch, Lingjing Jiang, Benjamin D. Kaehler, Kyo Bin Kang, Christopher R. Keefe, Paul Keim, Scott T. Kelley, Dan Knights, Irina Koester, Tomasz Kosciolek, Jorden Kreps, Morgan G. I. Langille, Joslynn Lee, Ruth Ley, Yong-Xin Liu, Erikka Loftfield, Catherine Lozupone, Massoud Maher, Clarisse Marotz, Bryan D. Martin, Daniel McDonald, Lauren J. McIver, Alexey V. Melnik, Jessica L. Metcalf, Sydney C. Morgan, Jamie T. Morton, Ahmad Turan Naimey, Jose A. Navas-Molina, Louis Felix Nothias, Stephanie B. Orchanian, Talima Pearson, Samuel L. Peoples, Daniel Petras, Mary Lai Preuss, Elmar Pruesse, Lasse Buur Rasmussen, Adam Rivers, Michael S. Robeson II, Patrick Rosenthal, Nicola Segata, Michael Shaffer, Arron Shiffer, Rashmi Sinha, Se Jin Song, John R. Spear, Austin D. Swafford, Luke R. Thompson, Pedro J. Torres, Pauline Trinh, Anupriya Tripathi, Peter J. Turnbaugh, Sabah Ul-Hasan, Justin J. J. van der Hooft, Fernando Vargas, Yoshiki Vázquez-Baeza, Emily Vogtmann, Max von Hippel, William Walters, Yunhu Wan, Mingxun Wang, Jonathan Warren, Kyle C. Weber, Charles H. D. Williamson, Amy D. Willis, Zhenjiang Zech Xu, Jesse R. Zaneveld, Yilong Zhang, Qiyun Zhu, Rob Knight & J. Gregory Caporaso. (2019). Reproducible, interactive, scalable and extensible microbiome data science using QIIME 2. Nature biotechnology. 37(8), 852-857.

    Gavin M. Douglas, Vincent J. Maffei, Jesse Zaneveld, Svetlana N. Yurgel, James R. Brown, Christopher M. Taylor, Curtis Huttenhower & Morgan G. I. Langille. (2019). PICRUSt2: An improved and extensible approach for metagenome inference. BioRxiv. 672295.

    Marcel Martin. (2011). Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet. journal. 17(1), 10-12.

    R Core Team. (2020). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria.

    Scott W. Olesen, Claire Duvallet & Eric J. Alm. (2017). dbOTU3: A new implementation of distribution-based OTU calling. PloS one. 12(5).

    Siddhartha Mandal, Will Van Treuren, Richard A. White, Merete Eggesbø, Rob Knight & Shyamal D. Peddada. (2015). Analysis of composition of microbiomes: a novel method for studying microbial composition. Microbial ecology in health and disease, 26(1), 27663.