Question

How to properly identify most likely species from small amplicon sequences?

0

Entering edit mode

4.2 years ago

GBC_Zonatos ▴ 10

After sequencing 16S amplicons for various samples that were (supposedly) isolated cultures, I've found several distinct amplicons, which varying frequencies. When BLASTing these amplicons, the most frequent ones return 100% identity and coverage to multiple species in the same genus on NCBI's nucleotide database.

When trying to BLAST other sequences from the same sample with less counts, these matched to specific species. Should I try to identify samples using the less frequent sequences? Or simply ignore them and acknowledge all species that matched on BLAST at 100% as 'possible/probable species'?

I've tried building phylogenetic trees based on the sequences, and the sequences of similar species, but due to the fact my amplicons are quite short (283bp), I can't get significant bootstrap support for most nodes.

16s phylogeny amplicon gene sequence • 812 views

ADD COMMENT • link updated 4.2 years ago by Charles Warden 8.2k • written 4.2 years ago by GBC_Zonatos ▴ 10

score 0 · Answer 1 · 2020-02-04

0

Entering edit mode

4.2 years ago

Mensur Dlakic ★ 27k

The reason your bootstrap support is not significant has nothing to do with short sequences, as 283 bp is plenty long. More likely it has to do with uncertainty in their relationships. Maximum likelihood methods struggle with identical or near-identical sequences, because 3 identical sequences can create at least 3 different branch configurations that are all equally likely, and ML programs tend to be bamboozled by it. Sometimes adding several known sequences to this collection will help stabilize the tree. Lastly, you may wish to try Bayesian methods as they tend to give more significant branch posterior probabilities.

Whether you throw away some of your sequences depends on your goal. I would be more interested in sequences that are not 100% identical to what is already in the database, because that at least has a chance of being something novel. The less frequent sequences are also community members, though likely less abundant. Again, I wouldn't want to discard them because the species criterion should not be something that has 100% coverage and 100% identity to already known species. In fact, in most cases I'd be looking for sequences that do not fulfill the 100% criterion.

ADD COMMENT • link 4.2 years ago by Mensur Dlakic ★ 27k

0

Entering edit mode

I have already checked my sequences for identical ones, and clustered them at 100% identity, though indeed several of them are very similar, at over 99% identity. As for creating the tree, would you recommend any particular program? I'm currently using IQTree, with automatic model prediction, and using a high number of iterations.

Originally, I was trying to build a tree using all of my sequences, from all samples, which mapped on BLAST to 4 different genus. I've now been trying to make individual trees for each genus, only using an outgroup sequence and reference sequences from that specific genus (focusing on sequences from species that matched on the original BLAST analysis), using a subset in order to try and get better results, but am also getting low bootstrap support (under 0,2).

ADD REPLY • link 4.2 years ago by GBC_Zonatos ▴ 10

0

Entering edit mode

Not sure that using any particular program will change much in your case, as it is not easy to tease out the evolutionary history of near-identical sequences.

With a caveat that it still may not give you reliable branch support, it is almost a guarantee that Bayesian inference programs will give you somewhat better branch support. Here is a software repository of phylogenetic programs, so try something from the Bayesian section. I like MrBayes and PhyloBayes, but others probably work just as well. You may want to try one of the servers first as it will give you an idea what to expect before spending time on program installation and learning how they work.

ADD REPLY • link 4.2 years ago by Mensur Dlakic ★ 27k

score 0 · Answer 2 · 2020-02-04

I have only worked with a somewhat limited number of larger samples. However, to me, the main difference with having PacBio full length 16S sequences (V19) versus shorter MiSeq (V13, V34, V45, etc.) or HiSeq sequences (V3, V4, V5, etc.) was the ability to have more genera assignments for a larger fraction of total reads.

If you have a similar (or larger) reference set, perhaps doing something like using BLAST or BWA (or another classifier) can help give you some sense of the robustness of the results. However, I think I agree that you may need some other marker to be more confident in a species-level assignment. In other words, I don't think it is unusual for you to have trouble getting confidence with a species-level assignment for what you are describing.