Error with makeblastdb on TREMBL data FASTA file ?
1
0
Entering edit mode
8.7 years ago
DavidK • 0

I did create a Blast database using the TREMBL fasta file from Uniprot.

Inside my database (I used the -parse_seqids option):

(Thanks to the command : blast/bin/blastdbcmd -db my_blastdb -dbtype prot -entry 'G3S368')

​​>tr|G3S368|G3S368_GORGO Uncharacterized protein OS=Gorilla gorilla gorilla GN=CTLA4 PE=4 SV=1
RYSVGDNDSNNVSIIDTSTNSVVGTVNVGLSTYNVAFTPDGKKIYATNSRNNTTSVIDVTTNKVTATVPTGDHPTDIAVS
PDGNKVYITNTGSNDLSVIDVTTNKVTATVPVGDGPCGVAVTLDGKKAYVPNKRSNTVSVINATTNTVTATVPVGITPLG
VAVTPDGNKVYVTNAESGNVSVIDTATNKVTATVNTGKYYMNYPVEVVIVPFMDSNMTDQSIGATSNAT

On the uniprot website:

​>tr|G3S368|G3S368_GORGO Uncharacterized protein OS=Gorilla gorilla gorilla GN=CTLA4 PE=4 SV=1
MACLGFQRHKAQLNLATRTWPCTLLFFLLFIPVFCKAMHVAQPAVVLASSRGIASFVCEY
ASPGKATEVRVTVLRQADSQVTEVCAATYMMGNELTFLDDSICTGTSSGNQVNLTIQGLR
AMDTGLYICKVELMYPPPYYLGIGNGTQIYVIDPEPCPDSDFLLAFWVFFVKLSQSLFLL
SSIQVGTQYVLSSIMLKKRSPLTTGVYVKMPPTEPECEKQFQPYFIPIN

Any idea ? I think this is not the only record with that issue.

I did check and the sequence in the FASTA file is the good one.

Update

I've created the database a second time. Seems I have not run into the same issue.

The sequence is the same in the database and in the original FASTA file.

This time, I didn't use the following option: -max_file_sz 5GB

Is it possible this option was the reason of the encountered issue?

uniprot trembl blast database • 2.1k views
ADD COMMENT
1
Entering edit mode

I don't quite get what are you pointing out as being the underlying problem. The fact that id's don't match? The fact that sequences are not the same? And how exactly is this a makeblastdb issue ?

mxs

ADD REPLY
0
Entering edit mode

The sequences are different but it's the same protein. Why do I have this sequence in the database and not the same as uniprot ? Is it an error during makeblastdb ? I've checked the file and I've found the record uniprot has. I wonder if it's possible that there are the two sequences with the same header in the fasta file.

It seems that's not the case. The FASTA file contains only one sequence for this identifier.

ADD REPLY
0
Entering edit mode

Can you locate the G3S368 in your downloaded TREMBL database? Though it is possible that during the makeblastdb something went wrong this is highly UNLIKELY. so my guess is that some kind of a mix up might have happened on a web <-> trembl-dmp relation, given that you did not yourself do some pre-processing of the downloaded data.

ADD REPLY
0
Entering edit mode

In the FASTA file downloaded from Uniprot, I've found the following entry:

>tr|G3S368|G3S368_GORGO Uncharacterized protein OS=Gorilla gorilla gorilla GN=CTLA4 PE=4 SV=1
MACLGFQRHKAQLNLATRTWPCTLLFFLLFIPVFCKAMHVAQPAVVLASSRGIASFVCEY
ASPGKATEVRVTVLRQADSQVTEVCAATYMMGNELTFLDDSICTGTSSGNQVNLTIQGLR
AMDTGLYICKVELMYPPPYYLGIGNGTQIYVIDPEPCPDSDFLLAFWVFFVKLSQSLFLL
SSIQVGTQYVLSSIMLKKRSPLTTGVYVKMPPTEPECEKQFQPYFIPIN

(~ line 90585230)

You're right I didn't do anything to the downloaded data. But the FASTA file is structured so that I'm able to get hit_id, hit_def and sequence. I am not able to figure out what went wrong.

ADD REPLY

Login before adding your answer.

Traffic: 2622 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6