Question: How to deal with many uncharacterized protein in the blastx results?
0
gravatar for seta
4.2 years ago by
seta1.2k
Sweden
seta1.2k wrote:

Hi all,

I recently made a de novo assembled transcriptome of non-model plant and run blastx of assembly against Uniprot (viridiplantae). Although, about 70% of contigs got the best hits, most of hits were uncharacterized protein that isn't interesting. I used the command of "/blastx -query file1.fasta -db uniprot -out file1_uni.txt -evalue 1e-3 -max_target_seqs 20 -outfmt '6 std sscinames scomnames stitle' -num_threads 9" and then using the command of "export LANG=C; export LC_ALL=C; sort -k1,1 -k12,12gr -k11,11g -k3,3gr blastout.txt | sort -u -k1,1 --merge > bestHits", tried to get the best hit. Could you please let me know your opinion about the results and help me out to reduce the number of uncharacterized protein hits?

sequencing blast rna-seq assembly • 1.5k views
ADD COMMENTlink modified 4.1 years ago by Biostar ♦♦ 20 • written 4.2 years ago by seta1.2k
4

In my opinion, `1e-3` is not a very stringent e-Value threshold and the best-hit not a suitable option for annotation transfer. How do you handle a BLAST best hit with a e-Value of `0.9e-3`? Do you annotate that transcript with the hit's function/name? Because such an e-Value can originate from a very short match.

For a qualified annotation I would also include for example protein domain information and only transfer the best hits annotation if the match between query and subject/database sequence spans most of the transcript etc. But I think there is no clear "best practice" for this because the annotation process depends on too much variables, e.g., you having a non-model plant which likely lacks comparative sequences in databases. However, such more stringent filtering will reduce the fraction of contigs with an annotation but if you want high-quality annotations you can be sure about, this would be the path that I would follow.

However, having a large number of uncharacterized contigs is normal in my opinion. A large number of proteins in public databases is uncharacterized and thats it. What you probably can do is using alternative database, e.g., the KEGG Orthology groups (KO). But here, you definitively need to use more stringent thresholds (e.g., query sequence has to match the protein at 80% length with 75% positive/identity or something similar).

ADD REPLYlink modified 4.1 years ago • written 4.1 years ago by Manuel Landesfeind1.2k
1

I agree about the e-value threshold being way too relaxed. In my opinion, significant hits start around 1e-6. In another post, I have already explained the shortcomings of sorting blastx results like this. A simple way to reduce "uncharacterized protein" hits is to first filter them away from the blastx output (if the sequence titles are included in output then simply by grep -v "uncharacterized protein" input | ..). A more sophisticated approach might only consider alternative hits that are e.g. within 0.X bitscrore of the highest score hit..

ADD REPLYlink written 4.1 years ago by 5heikki8.6k

Thanks for comments. I agree with you about the e-value, however, I prefer not to be strict as I'm working on a non-model plant with almost nothing information. I recently did blastx against other databases that many "uncharacterized proteins" hits turn to be known in the results, but I'm not sure how the "uncharacterized protein" can be logically replaced with known proteins from another database. Could you please let me know if there is a way to integrate the various blastx results (obtained from several databases) to have a single informative result?

ADD REPLYlink written 4.1 years ago by seta1.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1275 users visited in the last hour