Question: Error with makeblastdb on TREMBL data FASTA file ?
0
gravatar for DavidK
3.7 years ago by
DavidK0
France
DavidK0 wrote:

I did create a Blast database using the TREMBL fasta file from Uniprot.

Inside my database (I used the -parse_seqids option) :

(Thanks to the command : 
blast/bin/blastdbcmd -db my_blastdb -dbtype prot -entry 'G3S368')

​​>tr|G3S368|G3S368_GORGO Uncharacterized protein OS=Gorilla gorilla gorilla GN=CTLA4 PE=4 SV=1
RYSVGDNDSNNVSIIDTSTNSVVGTVNVGLSTYNVAFTPDGKKIYATNSRNNTTSVIDVTTNKVTATVPTGDHPTDIAVS
PDGNKVYITNTGSNDLSVIDVTTNKVTATVPVGDGPCGVAVTLDGKKAYVPNKRSNTVSVINATTNTVTATVPVGITPLG
VAVTPDGNKVYVTNAESGNVSVIDTATNKVTATVNTGKYYMNYPVEVVIVPFMDSNMTDQSIGATSNAT

On the uniprot website :
​>tr|G3S368|G3S368_GORGO Uncharacterized protein OS=Gorilla gorilla gorilla GN=CTLA4 PE=4 SV=1
MACLGFQRHKAQLNLATRTWPCTLLFFLLFIPVFCKAMHVAQPAVVLASSRGIASFVCEY
ASPGKATEVRVTVLRQADSQVTEVCAATYMMGNELTFLDDSICTGTSSGNQVNLTIQGLR
AMDTGLYICKVELMYPPPYYLGIGNGTQIYVIDPEPCPDSDFLLAFWVFFVKLSQSLFLL
SSIQVGTQYVLSSIMLKKRSPLTTGVYVKMPPTEPECEKQFQPYFIPIN

Any idea ? I think this is not the only record with that issue.

I did check and the sequence in the FASTA file is the good one.

Update :

I've created the database a second time. Seems I have not run into the same issue.

The sequence is the same in the database and in the original FASTA file.

This time, I didn't use the following option : -max_file_sz 5GB

Is it possible this option was the reason of the encountered issue ?

uniprot blast database trembl • 954 views
ADD COMMENTlink modified 3.7 years ago • written 3.7 years ago by DavidK0
1

I don't quite get what are you pointing out as being the underlying problem. The fact that id's don't match? The fact that sequences are not the same? And how exactly is this a makeblastdb issue ?

 

mxs

ADD REPLYlink written 3.7 years ago by mxs530

The sequences are different but it's the same protein. Why do I have this sequence in the database and not the same as uniprot ? Is it an error during makeblastdb ? I've checked the file and I've found the record uniprot has. I wonder if it's possible that there are the two sequences with the same header in the fasta file.

It seems that's not the case. The FASTA file contains only one sequence for this identifier.

ADD REPLYlink modified 3.7 years ago • written 3.7 years ago by DavidK0

Can you locate the G3S368 in your downloaded TREMBL database?? Though it is possible that during the makeblastdb something went wrong this is highly UNLIKELY. so my guess is that some kind of a mix up might have happened  on a web <-> trembl-dmp relation, given that you did not yourself do some pre-precesing  of the downloaded data .

ADD REPLYlink modified 3.7 years ago • written 3.7 years ago by mxs530

In the FASTA file downloaded from Uniprot, I've found the following entry :

 

>tr|G3S368|G3S368_GORGO Uncharacterized protein OS=Gorilla gorilla gorilla GN=CTLA4 PE=4 SV=1

MACLGFQRHKAQLNLATRTWPCTLLFFLLFIPVFCKAMHVAQPAVVLASSRGIASFVCEY

ASPGKATEVRVTVLRQADSQVTEVCAATYMMGNELTFLDDSICTGTSSGNQVNLTIQGLR

AMDTGLYICKVELMYPPPYYLGIGNGTQIYVIDPEPCPDSDFLLAFWVFFVKLSQSLFLL

SSIQVGTQYVLSSIMLKKRSPLTTGVYVKMPPTEPECEKQFQPYFIPIN

(~ line 90585230)

You're right I didn't do anything to the downloaded data. But the FASTA file is structured so that I'm able to get hit_id, hit_def and sequence. I am not able to figure out what went wrong.

ADD REPLYlink written 3.7 years ago by DavidK0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2012 users visited in the last hour