1
1
Entering edit mode
4.6 years ago

Hello! I am stuck with one thing. I am using QIIME2 for my 16S Anslysis. I am trying to filter reads in the denoising step and I am getting the representative sequence set which i am not able to understand. I hereby share some stats of the denoising step performed using dada2 in the table below:

Trunc-Len               Reads            Non-Chimeric Sequences
0                       420355           1946
40                      52320            1308
100                     455600           4556
200                     104200           3521
300                     2400             8


As per what I understood, it is filtering out the bases above the the given trunc length.

What I donâ€™t understand is why it is also not considering those reads which are less than the given trunc length. It only considers the reads with length more the the trunc length provided and truncates the remaining bases. Also, I do not understand, why the representative sequnces set is of the exact length as that of the trunc length. Whatever the trunc length is given, the representative set becomes of that length exactly as the trunc length.

I dont understand why this is happening. What can be the consequences of these in terms of assigning the taxonomy specially in case of de-novo based method.

Please help me learn and understand the parameter so that I can proceed with the elaborate knowledge in order to analyse my data correctly.

Thanks to all of you in advance for helping me understand the pararmeter.

Best Regards, Rahul

QIIME2 • 4.6k views
0
Entering edit mode

Thank You so much gb for sharing your knowledge.

0
Entering edit mode
4.6 years ago
gb ★ 2.2k

All reads need to have the same length. This is needed to know what needs to be "denoised" and needs to be put together.

This is not exactly how DADA2 works but it will give you an idea. The first step is dereplication, so all the reads that are 100% the same will be "put together". Let say you have a bunch of the following reads:

>Read1
AGTAGATGATGATGATATA
AGTAGATGATGATGATATAA


They are clearly from the same species, but will not be put together in the same otu because they are not 100% the same (length is different).

>Read1:100000
AGTAGATGATGATGATATA
AGTAGGTGATGATGGTATA


Read1 is present 100000 times in your sample en read2 only 2 times. After all the sampling and PCR steps that you did you have reads that are 100% the same and you have them so many times. This means that is really something that really is present in your sample. A read that you only have 2 times even after PCR is probably noise. So will be removed (denoised). All the representative sequences are all the same length because the input is also all the same length.

Again, the algorithm is a bit more complex but this maybe helps you with the first steps in understanding. (DADA2 does not just trow out reads with an abundance of 2, there are all kind of statistical calculation about when something is noise and what not)

One possible problem is that you trow out extremely low abundant species.

In terms of assigning the taxonomy it is not a problem. If you blast the representative sequences and it would still be full of PCR errors or chimeras you don't get a hit and you would be wondering forever what it could be. In theory (in theory, in practice it is different so don't read this to seriously) you only have real sequences so if you don't have a significant good blast hit now it is just a bacteria that is not present in the reference database yet.

If other people want the right a better explanation I will move my answer to a comment.