Question

Is it required for all sequences to be same length when clustering OTUs in metagenomics (ONT Nanopore)

0

Entering edit mode

20 months ago

O.rka ▴ 710

I am following a pipeline one of my collaborators created for using ONT reads for 16S rRNA OTU clustering. One of the steps, they truncate all of the reads so they are the same length (e.g., 1400).

Is this required or to have the sequences all at the same length? I feel like I'm arbitrarily throwing away useful information.

otu metagenomics clustering nanopore • 1.2k views

ADD COMMENT • link updated 20 months ago by antonioggsousa 3.2k • written 20 months ago by O.rka ▴ 710

score 1 · Answer 1 · 2022-08-05

1

Entering edit mode

20 months ago

antonioggsousa 3.2k

Hi,

In my opinion, yes, it is. Of course, this depends on several variables, such as the primers used and the expected gene length, pipeline/alignmet used etc.

OTUs (Operational Taxonomic Units) are defined based on a threshold of similarity, such as 97-99%, meaning that for a particular OTU, let's say OTU1, the sequences that comprised OTU1 show a sequence similarity of >97-98% (this is based on sequence alignment).

In general, aligning sequences of the same length is easier and faster to resolve the best alignment.

Depending on the alignment algorithm, if it uses some kind of global alignment, the shorter sequences will have less similarity than longer sequences even if they perfectly align with longer sequences, simply because they don't align across the whole sequence, and, therefore, yield a lower similarity identity.

I hope this helps,

António

ADD COMMENT • link 20 months ago by antonioggsousa 3.2k

0

Entering edit mode

This actually helps a lot! I guess that's one of the technical differences between performing ASV and OTU analysis that is under the hood.

ADD REPLY • link 20 months ago by O.rka ▴ 710

0

Entering edit mode

Absolutely. With ASVs you're working with exact sequences. Even though, you always check if the ASV sequence length range is among your expectations (based on the primers user - see the DADA2 tutorial): https://benjjneb.github.io/dada2/tutorial.html (citing below)

Considerations for your own data: Sequences that are much longer or shorter than expected may be the result of non-specific priming. You can remove non-target-length sequences from your sequence table (eg. seqtab2 <- seqtab[,nchar(colnames(seqtab)) %in% 250:256]). This is analogous to “cutting a band” in-silico to get amplicons of the targeted length.

António

ADD REPLY • link 20 months ago by antonioggsousa 3.2k