Question: Obtaining BLAST results with only one subject per query and only one query per subject
0
gravatar for tal.shalev
3.1 years ago by
tal.shalev0
tal.shalev0 wrote:

Hi,

I am annotating a gene set and am having some difficulty. Before blasting against larger databases I want to blast my gene set against a small set of genes that have been annotated in my lab first in order to make sure these are given precedence over other hits. For example, If I blast against nr the top hit for a sequence may just be "unknown protein", but I may get a hit which has a slightly lower bitscore or percent homology, but with a more accurate description from my small set. I would like to get the best hit for each sequence and then exclude those sequences when blasting against larger databases.

The problem is, when blasting against the smaller set I sometimes get many sequences matching the same target sequence, with varying levels of homology. I only want to include the best match. Blast+ seems to have options to only include one match for each sequence in my gene set (query), but it will often have multiples of sequences in my target set (subject). This is hard to explain, I know, so I will show an example.

Here is my blast command:

blastp -query my_gene_set.fasta -db manual_protein_annotations.fasta -evalue 1e-10 -outfmt 6 -out blastp.outfmt6 -max_target_seqs 1

And some of my output:

evgvelvLoc913t2 GQ00411_K20.1   76.99   365 77  2   1   359 1   364 0.0 605
evgvelvLoc913t3 GQ00411_K20.1   76.44   365 79  2   1   359 1   364 0.0 605
evgvelvLoc913t5 GQ00411_K20.1   77.84   352 77  1   8   359 14  364 0.0 601
evgvelvLoc913t6 GQ00411_K20.1   74.52   365 86  2   1   359 1   364 0.0 592
evgvelvLoc913t7 GQ00411_K20.1   75.34   365 83  2   1   359 1   364 0.0 596
evgvelvLoc934t15    k60_k66_1849292_rc  71.43   154 40  1   11  164 341 490 6e-77   238
evgvelvLoc934t17    k60_k66_1849292_rc  72.73   154 38  1   11  164 341 490 5e-77   239

As you can see, the queries only show up once each but subjects can match multiple queries. I want to take lower scoring queries and remove the HSP entirely so that it can be 'freed up' to be searched against other databases. I have pretty much exhausted the blast+ options I understand with no luck, eg. max_hsps, culling_limit, window_size, etc.

Does anyone have any idea how to do this? Bonus points if you can figure out how to do it to the blast file in xml or archival format, but just processing the tabular format is fine.

Thanks a lot!

ADD COMMENTlink modified 3.1 years ago by Biostar ♦♦ 20 • written 3.1 years ago by tal.shalev0

Have you tried the -max_target_seqs option ?

EDIT: Yes you did. I didn't scroll far enough to the right to see the end of your command.

You can filter the output to get just the top HSP for each hit (HSPs should be sorted by decreasing score by default). Note that BLAST is a local alignment algorithm so having multiple HSPs is what it's made for. If that's not what you want, maybe you should consider using another alignment algorithm.

ADD REPLYlink modified 3.1 years ago • written 3.1 years ago by Jean-Karim Heriche18k

Take a look at this thread and see if any options there help: Unexpected BLAST results
I have a feeling that you are going to have to post process the results to achieve what you need. See some options here: http://seqanswers.com/forums/showthread.php?t=27735

ADD REPLYlink written 3.1 years ago by genomax67k

Thanks for that link. I was surprised to see the issue with the "bug" in max_target_seqs. I think I will either not use any flags for my blast searches or just use another alignment algorithm as Jean-Karim suggested.

ADD REPLYlink written 3.1 years ago by tal.shalev0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1168 users visited in the last hour