blastdbcmd error: wrong typed FASTA headers
1
0
Entering edit mode
2.3 years ago
Agenor Neto ▴ 10

Hi Biostars community! I searched for this error in blastdbcmd here and just did not found. The fact is I am using this command to retrieve a set of proteins that are returned to me when I do a previous BLASTp operation. Fine. I just get all the IDs that are given to me, put them in a .file document (which I think it works as a .txt file, someone correct me if I am wrong) and then I give this file as an argument to the -entry_batch parameter of blastdbcmd. The program works, but the problem is that I am getting this type of FASTA file:

>XP_009183870.1 U4/U6.U5 tri-snRNP-associated protein 1 isoform X2 [Papio anubis] >XP_011719200.1 U4/U6.U5 tri-snRNP-associated protein 1 isoform X2 [Macaca nemestrina] >XP_011820612.1 PREDICTED: U4/U6.U5 tri-snRNP-associated protein 1 [Mandrillus leucophaeus] >XP_015289917.1 PREDICTED: U4/U6.U5 tri-snRNP-associated protein 1 isoform X2 [Macaca fascicularis]
MALRQREELREKLAAAKEKRLLNQKLGKIKTLGEDDPWLDDTAAWIERSRQLQKEKDLAEKRAK...

. (I will not type the rest for readability)

where you can see a clear mistake in the FASTA header: there are FOUR '>' when we know that when writing FASTA headers it is recommended to not do it. Also you can perceive that there are FOUR FASTA header with FOUR FASTA names for different species . The last is the one I really want (with the id that BLAST has given me). But the others... I really do not know where they came from. And this is messing with downstream analysis I am trying to do.

Please, if you know how to correct this, inform me. Thanks!

blastdbcmd blast • 884 views
ADD COMMENT
0
Entering edit mode
2.3 years ago
GenoMax 141k

What database are you using for this? Since the sequence is common for multiple organisms there is one entry with multiple organisms in the fasta header for refseq_protein database..

$ blastdbcmd -db refseq_protein -entry XP_015289917 -outfmt %f
>XP_009183870.1 U4/U6.U5 tri-snRNP-associated protein 1 isoform X2 [Papio anubis] >XP_011719200.1 U4/U6.U5 tri-snRNP-associated protein 1 isoform X2 [Macaca nemestrina] >XP_011820612.1 PREDICTED: U4/U6.U5 tri-snRNP-associated protein 1 [Mandrillus leucophaeus] >XP_015289917.1 PREDICTED: U4/U6.U5 tri-snRNP-associated protein 1 isoform X2 [Macaca fascicularis]
MALRQREELREKLAAAKEKRLLNQKLGKIKTLGEDDPWLDDTAAWIERSRQLQKEKDLAEKRAKLLEEMDQEFGVSTLVE

If you want a specific entry then it may be better to retrieve it using Entrezdirect:

$ efetch -db protein -id XP_015289917 -format fasta
>XP_015289917.1 U4/U6.U5 tri-snRNP-associated protein 1 isoform X3 [Macaca fascicularis]
MALRQREELREKLAAAKEKRLLNQKLGKIKTLGEDDPWLDDTAAWIERSRQLQKEKDLAEKRAKLLEEMD
QEFGVSTLVEEEFGQRRQDLYSARDLQGLTVEHAIDSFREGETMILTLKDKGVLQEEEDVLVNVNLVDKE
ADD COMMENT
0
Entering edit mode

Hello! Little bit late but I tested your suggestion and it works... but not the way I want. efetch works only via the web as long as I know and tested. I really don not want to rely on web in this task I am performing. Open for any idea or suggestion. Thank you!

ADD REPLY
2
Entering edit mode

efetch works only via the web as long as I know and tested.

I linked the unix command line version for Entrez Utilities in my answer above. Answer I posted above came from that utility.

ADD REPLY

Login before adding your answer.

Traffic: 1581 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6