Question

Cause of short sequences in Amplicon Sequencing?

0

Entering edit mode

14 months ago

Saran ▴ 50

Hello,

Amplicon sequencing was performed across 9 samples and then I used DAD2 to analyze the variants. 83 Variants were found and below are my top 5:

1.AGTGCTTAAAGACTATTGTTTTATTCCCAAATTGTTCTCTTAATTTTATAACTATCTTATTTAAAGTGTCATTCCATTTTGCTCTACTAAGGTTACAATGTGCTTGTCTTATATCTCCTATTATTTCTCCTGTTGTATAAAATGCTCTGCCTGGTCCTATATTTATACTTTTT

2.ATTGCTTAAAGATTATTGTTTTATTATTTCCAAATTGTTCTCTTAATTTGCTAGCTATCTGTTTTAAAGTGGCATTCCATTTTGCTCTACTAATGTTACAATGTGCTTGTCTCATATTTCCTATTTTTCCTATTGTAACAAATGCTCTCCCTGGTCCCCTCTGGATACGGATACTTTTT

3.GTGCTTAAAGACTATTGTTTTATTCCCAAATTGTTCTCTTAATTTTATAACTATCTTATTTAAAGTGTATCTCCTATTATTTCTCCTGTTGTATAAAATGCTCTGCCTGGTCCTATATTTATACTTTTT

4.AGTGCCTGGTCCTATATTTATACTTTTT

5.AGTGCTTAAAGACTATTGTTTTATTAAAATGCTCTGCCTGGTCCTATATTTATACTTTTT

The first three are as expected but what are these very short sequences and what causes them? I imagine they are sequencing error but what exactly is happening here to get counts of many short sequences?

Thank You, Sara

Illumina ASV Amplicon DADA2 • 654 views

ADD COMMENT • link 14 months ago by Saran ▴ 50

2

Entering edit mode

I expect those short snippets really are present in your library, rather than just showing up due to your sequencing or dada2 usage. Take a look at how they compare in an MSA:

>seq1
AGTGCTTAAAGACTATTGTTTTAT---TCCCAAATTGTTCTCTTAATTTTATAACTATCTTATTTAAAGTGTCATTCCATTTTGCTCTACTAAGGTTACAATGTGCTTGTCTTATATCTCCTATTATTTCTCCTGTTGTATAAAATGCTCTGCCTGGTCCTA------TATTTATACTTTTT
>seq2
ATTGCTTAAAGATTATTGTTTTATTATTTCCAAATTGTTCTCTTAATTTGCTAGCTATCTGTTTTAAAGTGGCATTCCATTTTGCTCTACTAATGTTACAATGTGCTTGTCTCATATTTCCT---ATTTTTCCTATTGTAACAAATGCTCTCCCTGGTCCCCTCTGGATACGGATACTTTTT
>seq3
-GTGCTTAAAGACTATTGTTTTAT---TCCCAAATTGTTCTCTTAATTTTATAACTATCTTATTTAAAGTGTA-------------------------------------------TCTCCTATTATTTCTCCTGTTGTATAAAATGCTCTGCCTGGTCCTA------TATTTATACTTTTT
>seq4
----------------------------------------------------------------------------------------------------------------------------------------------------AGTGCCTGGTCCTA------TATTTATACTTTTT
>seq5
AGTGCTTAAAGACTATTGTTTTAT--------------------------------------------------------------------------------------------------------------------TAAAATGCTCTGCCTGGTCCTA------TATTTATACTTTTT

Whenever we do amplicon sequencing we get some fraction of reads that just match one, the other, or both primer sequences, and people have told me that's to be expected due to primer dimer and the like. When all works well it's just a small fraction, but when the starting material is degraded or low abundance we see a lot more of that. Could that explain what you're seeing? For example I notice sequence #4 is nearly identical to the end of most of the others (reverse primer?) and #5 looks like the start and end put together (primer dimer?).

ADD REPLY • link 14 months ago by Jesse ▴ 740

1

Entering edit mode

Hey Saran

Some kind of quality filtering of the raw sequencing is necessary for DADA2 because low-quality reads will negativelly affect its accuracy to the estimates the error model. So, the actual number of ASV can be very different depending on how the raw sequencing data have been processed.

That said, I don't know why you get very short ASV but, dada2 community is very active, and you should probably ask there by explaining every step used to process the raw sequencing data and generate the ASVs.

ADD REPLY • link 14 months ago by andres.firrincieli 3.6k