Question: makeblastdb problem for bacterial genomes
0
gravatar for niu2rseq
4.4 years ago by
niu2rseq70
United States
niu2rseq70 wrote:

Hi 

 

I am trying to make a blast database for my metatranscriptomic data. 

I downloaded the whole bacterial_genome and bacterial_draft folder from NCBI ftp. Then I merged all faa sequences into one big file all.fasta which contains all the protein sequences from these two folders. It's huge, 11G. 

I was trying to make a prot database since I want to use blastx with my data against this database. 

My command is: 

module load blast+/2.2.30
makeblastdb -in Bacteria_all.fasta -out Bacterial_all_blastDB -dbtype prot -parse_seqids

But the problem is there are redundancy in this big fasta file so I got error for this job:

 

BLAST Database creation error: Error: Duplicate seq_ids are found: 
REF|YP_001740126.1|

I checked the data and find:

$ grep "YP_001740126.1|" Bacteria_genome_all_faa.fasta 
>gi|218960351|ref|YP_001740126.1| chromosomal replication initiation protein [Candidatus Cloacimonas acidaminovorans str. Evry]
>gi|218960351|ref|YP_001740126.1| chromosomal replication initiation protein [Candidatus Cloacamonas acidaminovorans]

Did anybody knows any method/tools to solve this problem? Or you would like to suggest download some other built database for my purpose? Thank you very much!

blast makeblastdb • 4.2k views
ADD COMMENTlink modified 4.4 years ago by RamRS20k • written 4.4 years ago by niu2rseq70
2

I would check the following first: every sequence is most likely in the NR database already. You could download NR and provide blastdb_aliastool with a gi list of all sequences to include, that should save you some work, you just need to generate the gi list. 

ADD REPLYlink written 4.4 years ago by Michael Dondrup45k

Thank you! So shall I download the nr database first?

ftp://ftp.ncbi.nlm.nih.gov/blast/db/ shall I download all these NR tgz files and merge all these together?

ADD REPLYlink written 4.4 years ago by niu2rseq70

please use

update_blastdb.pl nr

to download the archives. After download, you simply extract them with

tar xzf 

 

 

ADD REPLYlink modified 4.4 years ago • written 4.4 years ago by Michael Dondrup45k

where to get this perl script? 

ADD REPLYlink written 4.4 years ago by niu2rseq70

it is included with the blast binaries

ADD REPLYlink written 4.4 years ago by Michael Dondrup45k
1

Is it necessary to retain both/all sequences with matching IDs, or can you afford to de-duplicate them? 

ADD REPLYlink written 4.4 years ago by RamRS20k

I think maybe I can try to get all unique gi and then just get these sequences?

ADD REPLYlink written 4.4 years ago by niu2rseq70

That should work. What you need to figure out if how do you choose which sequence to pick given a GI matches two different sequences.

ADD REPLYlink written 4.4 years ago by RamRS20k

I am getting a count file for all gis here:

 

grep ">.+\|\d+\|\w+\|.+\|\s" Bacteria_genome_all_faa.fasta |cut -d " " -f 1|sort|uniq -d > repeated_BG_gi.t
ADD REPLYlink written 4.4 years ago by niu2rseq70

Should you not be using uniq -c?

Now that you have unique GIs, when a GI matches 2 different sequences, which one are you gonna pick? You might just be better off replacing blank spaces with double underscores (__) in the identifier. That way, you shouldn't face this problem with makeblastdb.

ADD REPLYlink modified 4.4 years ago • written 4.4 years ago by RamRS20k

HI I tried this 

 

grep --color -F -f repeated_BG_gi.txt Bacteria_genome_all_faa.fasta |sort -k 1

 

And what I got is generally two types of replicates:

>gi|400752905|ref|YP_006529347.1| hypothetical protein USDA257_p05140 [Sinorhizobium fredii USDA 257]

>gi|400752905|ref|YP_006529347.1| hypothetical protein USDA257_p05140 [Sinorhizobium fredii USDA 257]

>gi|400752906|ref|YP_006529348.1| hypothetical protein USDA257_p05150 [Sinorhizobium fredii USDA 257]

>gi|400752906|ref|YP_006529348.1| hypothetical protein USDA257_p05150 [Sinorhizobium fredii USDA 257]

>gi|400752907|ref|YP_006529349.1| hypothetical protein USDA257_p05160 [Sinorhizobium fredii USDA 257]

>gi|400752907|ref|YP_006529349.1| hypothetical protein USDA257_p05160 [Sinorhizobium fredii USDA 257]

 

Another is: 

 

>gi|218962118|ref|YP_001741893.1| hypothetical protein CLOAM1860 [Candidatus Cloacimonas acidaminovorans str. Evry]

>gi|218962119|ref|YP_001741894.1| hypothetical protein; putative signal peptide [Candidatus Cloacamonas acidaminovorans]

>gi|218962119|ref|YP_001741894.1| hypothetical protein; putative signal peptide [Candidatus Cloacimonas acidaminovorans str. Evry]

>gi|218962120|ref|YP_001741895.1| putative Periplasmic serine proteases [Candidatus Cloacamonas acidaminovorans]

>gi|218962120|ref|YP_001741895.1| putative Periplasmic serine proteases [Candidatus Cloacimonas acidaminovorans str. Evry]

>gi|218962121|ref|YP_001741896.1| hypothetical protein CLOAM1863 [Candidatus Cloacamonas acidaminovorans]

>gi|218962121|ref|YP_001741896.1| hypothetical protein CLOAM1863 [Candidatus Cloacimonas acidaminovorans str. Evry]

>gi|218962122|ref|YP_001741897.1| conserved hypothetical protein [Candidatus Cloacimonas acidaminovorans str. Evry]

>gi|218962122|ref|YP_001741897.1| hypothetical protein CLOAM1864 [Candidatus Cloacamonas acidaminovorans]

 

so I think only select one sequence should be fine. What do you think? Thanks!

ADD REPLYlink written 4.4 years ago by niu2rseq70

I think the second set might be different sequences.

For set 1, you might wanna double check they are identical sequences before deduplicating them

For set 2, you're better off substituting the blank spaces in the IDs with a placeholder (say, a double underscore). makeblastdb complains because it splits the ID by white space and takes the second part as sequence description (not ID). This create complications as after the split, you end up with sequences with the same ID.

ADD REPLYlink written 4.4 years ago by RamRS20k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1005 users visited in the last hour