Sequence search parameters for annotation of de novo transcriptomes
0
0
Entering edit mode
2.7 years ago
Dunois ★ 2.5k

I've looked at quite a few papers (e.g., Carruthers et al. 2018 and Chabikwa et al. 2020) that have used BLAST searches to annotate protein coding sequences in their de novo assembled transcriptomes. Most just seem to take the best hit at an "appropriate" e-value threshold (e.g., −max_target_seqs 1, −evalue 1e-3) and be done with it. (Note: the annotations are usually done with translated sequences or translated searches against reference protein sets.)

But tools like BLAST expose some other options to the user that are just as relevant as the e-value, namely sequence identity and sequence coverage.

I have been unable to get an idea of what some agreeable values for these parameters are.

For instance, does it really make sense to consider transferring annotations to a short sequence (query) that has under 20% sequence identity, 15% target coverage, and something like 30% query coverage?

Do Biostars users have any takes on this issue of whether it is (in)appropriate to tweak the aforementioned additional parameters, and if yes, what the recommended cutoffs would be? Would something like a minimum of 20% sequence identity, and at least 50% query coverage be agreeable? (No target coverage limits since the query sequences may be fragments given the transcriptome has been assembled de novo.)

annotation transcriptome search blast • 732 views
ADD COMMENT
0
Entering edit mode

This is a really interesting question. Annotation of any sort with short-read sequence data is difficult. It is more difficult for non-model organisms. People tend to use BLAST (and other BLAST-like tools) a lot for annotation transcripts or contigs.

Although the following example may represent an extreme end of the spectrum, it is important not to draw too many conclusions from this sort of data analyses.

For instance, does it really make sense to consider transferring annotations to a short sequence (query) that has under 20% sequence identity, 15% target coverage, and something like 30% query coverage?

Moreover, it may be worth noting that annotating genomes/transcriptomes of a novel species can be a very challenging process and often time the scientists are trying their best to carry out analyses in the context of specific research questions.

Genome annotation is an iterative process. As more good quality data is made available, annotations tend to improve but we have to start somewhere...

ADD REPLY

Login before adding your answer.

Traffic: 2533 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6