Question: blast x results - some making no sense
4.5 years ago
Biogeek400 wrote:

Hi friends,

I'm annotating a higher plant which is well into the higher plant lineage, it is a pond species. I've conducted a BlastX on the uniprot viridiplantae database locally and I'm getting a lot of hits for basal lineages such as the unicellular green algae. The results to me are not making much sense - given the evolutionary advancement of this plant compared to lower plants. If I look down my next best hits vice versa, there is always a more appropriate hit.

As a result of the above, I then performed a blastX with just the embryophyte lineage of uniprot and trembl (land plants and aquatic plants) and the results make much more since; however, the % identity score is low in some transcripts.What are people's takes on this? Is it better to use a more specific database in such a case, given that the evolution of my non-model organism is clear to me, and is highly evolved on a much more distant branch of the higher plants, away from the early branch of the viridiplantae? OR do I just use the entire viridiplantae lineage?

I've additionally done a BlastN to detect any contamination; with a 95% cut-off and low e-value. The database for this is made for unicellular microbial algae and eukaryotes which are appearing in the first set of blast X hits. There was very little hits for this, of which I removed.

What are people's opinions? Thanks.

written 4.5 years ago by Biogeek400
4.5 years ago
Chris Fields
University of Illinois Urbana-Champaign
Chris Fields wrote:

I'm guessing this is from an assembly; is it a transcriptome or full genome?

It might be worth doing an overall non-biased analysis, maybe something like a blobplot against a larger database, just to see if there are any oddities in the data that might indicate problems (e.g. contaminating organisms, which are very common). We've done this using BLASTN and DIAMOND in place of BLASTX and have found this helps considerably (you can also use the results to help identify and filter the problematic sequences). You did say this was a pond plant and your hits are against algae...

written 4.5 years ago by Chris Fields

Hey Chris, thanks for this. It's a de novo transcriptome assembly. Essentially what files are needed for this? Will a fasta of the assembly suffice? Thanks

written 4.5 years ago by Biogeek400

Not sure how well this would work with a transcriptome assembly as the coverage varies so much (that is one critical component on a blobplot). But you could probably look at variation in GC content and overall what taxonomic groups are found via BLASTX.

EDIT: 'phylogenetic' -> 'taxonomic'

written 4.5 years ago by Chris Fields

An area which I agree needs some improvement ;-). Might stick with the embryophyta and then blastN. Would someone be criticised for this approach?

written 4.5 years ago by Biogeek400
