I've looked at quite a few papers (e.g., Carruthers et al. 2018 and Chabikwa et al. 2020) that have used
BLAST searches to annotate protein coding sequences in their de novo assembled transcriptomes. Most just seem to take the best hit at an "appropriate"
e-value threshold (e.g.,
−max_target_seqs 1, −evalue 1e-3) and be done with it. (Note: the annotations are usually done with translated sequences or translated searches against reference protein sets.)
But tools like
BLAST expose some other options to the user that are just as relevant as the
e-value, namely sequence identity and sequence coverage.
I have been unable to get an idea of what some agreeable values for these parameters are.
For instance, does it really make sense to consider transferring annotations to a short sequence (query) that has under 20% sequence identity, 15% target coverage, and something like 30% query coverage?
Biostars users have any takes on this issue of whether it is (in)appropriate to tweak the aforementioned additional parameters, and if yes, what the recommended cutoffs would be? Would something like a minimum of 20% sequence identity, and at least 50% query coverage be agreeable? (No target coverage limits since the query sequences may be fragments given the transcriptome has been assembled de novo.)