Get one hit per specie in BLASTp
1
0
Entering edit mode
6 weeks ago
Agenor Neto ▴ 10

Hi everyone! I have searched this answer but it seems a problem that is not yet solved: is there any way to perform BLAST and get just one hit per specie (taxid)?

Right now, I am performing BLASTp locally and it is working fine but, for my current purpose, isoforms are not what I want. Instead, I would like to retrive only the canonical form (and I am aware that this is not an information available for all the proteins). I would like to know if there is some solution, a built program or something I could programmatically do.

BLAST blastp • 320 views
ADD COMMENT
0
Entering edit mode

You will invariably need to post-process/parse your blast output (will be easier if you are using one of the tab delimited formats) to get this type of specific output.

ADD REPLY
1
Entering edit mode
6 weeks ago

I'm not certain what you mean by "canonical form", but I'll assume for discussion that you want 1 orthologue from a database such as Uniprot, to represent each species. In the BIRCH system, BLAST search output pops up in several windows, one of which is a BioLegato spreadsheet that lets you sort output in a spreadsheet form. enter image description here

In the example, you could sort first by TaxID for species, and secondarily for a characteristic such as alignment length or E-value. The top one for each species would therefore be the best hit, longest alignment etc. for that species. Whichever hits you select could then be retrieved from NCBI to a file or a new BioLegato object for subsequent analysis. This process is outlined in the tutorial Searching Local Sequence Databases Using BLAST.

ADD COMMENT
0
Entering edit mode

That's exactly the conclusion I got after reading the BLAST documentation: I am going to try to filter my results using features as e-value, percentage identity etc. For those who are trying the same thing here is a link of parameters which you can define using command line BLAST. Something that helped me to idealize this rationale was also the rationale used in RefSeq_select. It is used for transcripts and only has selected transcripts for human, mouse and prokaryotes but follows the same purpose we are discussing here. That said, when I get this task done, I will post here if this solution worked. Thanks!

ADD REPLY

Login before adding your answer.

Traffic: 1040 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6