I am annotating a gene set and am having some difficulty. Before blasting against larger databases I want to blast my gene set against a small set of genes that have been annotated in my lab first in order to make sure these are given precedence over other hits. For example, If I blast against nr the top hit for a sequence may just be "unknown protein", but I may get a hit which has a slightly lower bitscore or percent homology, but with a more accurate description from my small set. I would like to get the best hit for each sequence and then exclude those sequences when blasting against larger databases.
The problem is, when blasting against the smaller set I sometimes get many sequences matching the same target sequence, with varying levels of homology. I only want to include the best match. Blast+ seems to have options to only include one match for each sequence in my gene set (query), but it will often have multiples of sequences in my target set (subject). This is hard to explain, I know, so I will show an example.
Here is my blast command:
blastp -query my_gene_set.fasta -db manual_protein_annotations.fasta -evalue 1e-10 -outfmt 6 -out blastp.outfmt6 -max_target_seqs 1
And some of my output:
evgvelvLoc913t2 GQ00411_K20.1 76.99 365 77 2 1 359 1 364 0.0 605 evgvelvLoc913t3 GQ00411_K20.1 76.44 365 79 2 1 359 1 364 0.0 605 evgvelvLoc913t5 GQ00411_K20.1 77.84 352 77 1 8 359 14 364 0.0 601 evgvelvLoc913t6 GQ00411_K20.1 74.52 365 86 2 1 359 1 364 0.0 592 evgvelvLoc913t7 GQ00411_K20.1 75.34 365 83 2 1 359 1 364 0.0 596 evgvelvLoc934t15 k60_k66_1849292_rc 71.43 154 40 1 11 164 341 490 6e-77 238 evgvelvLoc934t17 k60_k66_1849292_rc 72.73 154 38 1 11 164 341 490 5e-77 239
As you can see, the queries only show up once each but subjects can match multiple queries. I want to take lower scoring queries and remove the HSP entirely so that it can be 'freed up' to be searched against other databases. I have pretty much exhausted the blast+ options I understand with no luck, eg. max_hsps, culling_limit, window_size, etc.
Does anyone have any idea how to do this? Bonus points if you can figure out how to do it to the blast file in xml or archival format, but just processing the tabular format is fine.
Thanks a lot!