Are there any tools that automatically detect where to truncate forward and reverse reads for 16S (DADA2)?
2
0
Entering edit mode
6 weeks ago
O.rka ▴ 620

I'm about to run DADA2 via QIIME2 and noticing that it's very hands-on and not completely automated.

Are there any reliable tools that can indicate where I should trim my forward and reverse reads based on input fastq?

Here's the documentation for the module I am running: https://docs.qiime2.org/2022.2/plugins/available/dada2/denoise-paired/

Is it not advised to merge all the reads, run through something like FastQC, and then find the region where the 25th percentile is lower than 30?

16s dada2 illumina metagenomics qiime2 • 387 views
0
Entering edit mode
6 weeks ago

Hi,

DADA2 implemented via wrapper python scripts in QIIME2 or directly in R can be as automatic as you want. You can write a bash script, snakemake or nextflow workflow (among many others) stating the order and dependencies of every step to run. In this way from a given input you'll have a desirable effortless output with your final results.

An automatic pipeline as described above works relatively well for very standardized analyses and data where you expect that data and results will behave within the standards. Of course, that even in such pipeline you will run QC steps that will produce QC plots which you should check to be certain that the data actually follows your expectations regarding its quality.

Are there any reliable tools that can indicate where I should trim my forward and reverse reads based on input fastq?

This depends on your criteria. There are tools that produce QC plots that help you decide about which values to choose and define as thresholds to trim and truncate your reads such as FastQC and MultiQC (which depends on the results produced by FastQC). The colors in these plots give you an indication of quality. There are many guides online explaining how to interpret these plots.

I would say that quite often people use recommended values which are not "bullet proof" but they work for standard data quality. You can attempt to use the values provided by the tutorial/qiime2 wrapper if you have the same type of data (which I think the values are intended for Illumina data - but I'm not sure) and hyper-variable region of 16S (if you are using a different hyper-variable region or read-length the values may not be adequate). In general, if you are not very familiar with the data that you're working on neither the analyses, I wouldn't recommend this approach as you may run things that you even don't know what they meaning and their implications. In that case, automatic and simple wrappers like the one you pointed out might be dangerous as it "hides" many sequential steps which makes difficult to understand their order and functionality.

I think the plugin that you pointed out implements in general the DADA2 workflow in R (DADA2 is an R package). Therefore, I would recommend that you check the DADA2 tutorial which may help to understand better the workflow, the order of steps, their implication and options implemented in the QIIME2 wrapper: https://benjjneb.github.io/dada2/tutorial.html

Is it not advised to merge all the reads, run through something like FastQC, and then find the region where the 25th percentile is lower than 30?

There are different possible approaches. Usually I would trim and truncate the forward and reverse reads of 16S rRNA sequences indenpendently before merging them. The DADA2 tutorial also follows this approach. Then you denoise your sequences (after learning the error rates), and only then you merge them.

I hope this helps,

António

0
Entering edit mode

DADA2 implemented via wrapper python scripts in QIIME2 or directly in R can be as automatic as you want. You can write a bash script, snakemake or nextflow workflow (among many others) stating the order and dependencies of every step to run. In this way from a given input you'll have a desirable effortless output with your final results.

Yes, that's my goal and I made a workflow package to do this called GenoPype.

An automatic pipeline as described above works relatively well for very standardized analyses and data where you expect that data and results will behave within the standards. Of course, that even in such pipeline you will run QC steps that will produce QC plots which you should check to be certain that the data actually follows your expectations regarding its quality.

This is the part that I want to automate. Every tutorial I've seen, people are generally following the same guidelines in finding where the read quality dips down under 30. On the official QIIME 2 YouTube channel in Denoising sequence data with DADA2 the tutorial mentions that a good place to make a cutoff would be when the position where the 25th percentile dips down under 30. It concerns me to randomly grab a sample to estimate the quality when selecting these cutoffs. I'm aware that all of the sequencing from a run should be similar but I'd rather use all the data and have it automated so it can be easily produced. I saw that there was a package called Figaro but I don't think it's under active development and the conda installed package was DOA.

This depends on your criteria. There are tools that produce QC plots that help you decide about which values to choose and define as thresholds to trim and truncate your reads such as FastQC and MultiQC (which depends on the results produced by FastQC). The colors in these plots give you an indication of quality. There are many guides online explaining how to interpret these plots.

Yes, I'm familiar with FastQC (not a fan of how clunky it is tbh, the output formats are terrible for machine reading) but that's not really what I am looking for as I'm trying to remove the qualitative thresholding so it can be 100% reproducible.

I would say that quite often people use recommended values which are not "bullet proof" but they work for standard data quality. You can attempt to use the values provided by the tutorial/qiime2 wrapper if you have the same type of data (which I think the values are intended for Illumina data - but I'm not sure) and hyper-variable region of 16S (if you are using a different hyper-variable region or read-length the values may not be adequate). In general, if you are not very familiar with the data that you're working on neither the analyses, I wouldn't recommend this approach as you may run things that you even don't know what they meaning and their implications. In that case, automatic and simple wrappers like the one you pointed out might be dangerous as it "hides" many sequential steps which makes difficult to understand their order and functionality.

I think if you set some parameters like "must be this long" or "if it cuts off more than this throw a fatal error" you can avoid these types of mishaps.

There are different possible approaches. Usually I would trim and truncate the forward and reverse reads of 16S rRNA sequences indenpendently before merging them. The DADA2 tutorial also follows this approach. Then you denoise your sequences (after learning the error rates), and only then you merge them.

Yea, I don't think that concatenation is the best way either. What I'm going to do is run stats for each sample, put them in a matrix, do a visual analysis on where I would make the cuts, and then see if I can automate it. If it works for a few datasets then I'll put it up on my GitHub so other people can have access.

0
Entering edit mode
6 weeks ago

I had very good results for 16S analysis using the nf-core ampliseq pipeline at https://github.com/nf-core/ampliseq

There were only a few problems configuring the input, but using an input CSV file solved this easily enough. DADA2 is part of that pipeline.