makeblastdb problem for bacterial genomes
0
0
Entering edit mode
9.5 years ago
niu2rseq ▴ 90

Hi

I am trying to make a blast database for my metatranscriptomic data.

I downloaded the whole bacterial_genome and bacterial_draft folder from NCBI ftp. Then I merged all faa sequences into one big file all.fasta which contains all the protein sequences from these two folders. It's huge, 11G.

I was trying to make a prot database since I want to use blastx with my data against this database.

My command is:

module load blast+/2.2.30
makeblastdb -in Bacteria_all.fasta -out Bacterial_all_blastDB -dbtype prot -parse_seqids

But the problem is there are redundancy in this big fasta file so I got error for this job:

BLAST Database creation error: Error: Duplicate seq_ids are found: 
REF|YP_001740126.1|

I checked the data and find:

$ grep "YP_001740126.1|" Bacteria_genome_all_faa.fasta 
>gi|218960351|ref|YP_001740126.1| chromosomal replication initiation protein [Candidatus Cloacimonas acidaminovorans str. Evry]
>gi|218960351|ref|YP_001740126.1| chromosomal replication initiation protein [Candidatus Cloacamonas acidaminovorans]

Did anybody knows any method/tools to solve this problem? Or you would like to suggest download some other built database for my purpose? Thank you very much!

blast makeblastdb • 6.5k views
ADD COMMENT
2
Entering edit mode
ADD REPLY
2
Entering edit mode

I would check the following first: every sequence is most likely in the NR database already. You could download NR and provide blastdb_aliastool with a gi list of all sequences to include, that should save you some work, you just need to generate the gi list.

ADD REPLY
0
Entering edit mode

Thank you! So shall I download the nr database first?

ftp://ftp.ncbi.nlm.nih.gov/blast/db/ shall I download all these NR tgz files and merge all these together?

ADD REPLY
0
Entering edit mode

Please use update_blastdb.pl nr to download the archives. After download, you simply extract them with tar xzf.

ADD REPLY
0
Entering edit mode

Where to get this perl script?

ADD REPLY
0
Entering edit mode

it is included with the blast binaries

ADD REPLY
1
Entering edit mode

Is it necessary to retain both/all sequences with matching IDs, or can you afford to de-duplicate them?

ADD REPLY
0
Entering edit mode

I think maybe I can try to get all unique gi and then just get these sequences?

ADD REPLY
0
Entering edit mode

That should work. What you need to figure out if how do you choose which sequence to pick given a GI matches two different sequences.

ADD REPLY
0
Entering edit mode

I am getting a count file for all gis here:

grep ">.+\|\d+\|\w+\|.+\|\s" Bacteria_genome_all_faa.fasta |cut -d " " -f 1|sort|uniq -d > repeated_BG_gi.t
ADD REPLY
0
Entering edit mode

Should you not be using uniq -c?

Now that you have unique GIs, when a GI matches 2 different sequences, which one are you gonna pick? You might just be better off replacing blank spaces with double underscores (__) in the identifier. That way, you shouldn't face this problem with makeblastdb.

ADD REPLY
0
Entering edit mode

HI I tried this

grep --color -F -f repeated_BG_gi.txt Bacteria_genome_all_faa.fasta |sort -k 1

And what I got is generally two types of replicates:

>gi|400752905|ref|YP_006529347.1| hypothetical protein USDA257_p05140 [Sinorhizobium fredii USDA 257]
>gi|400752905|ref|YP_006529347.1| hypothetical protein USDA257_p05140 [Sinorhizobium fredii USDA 257]
>gi|400752906|ref|YP_006529348.1| hypothetical protein USDA257_p05150 [Sinorhizobium fredii USDA 257]
>gi|400752906|ref|YP_006529348.1| hypothetical protein USDA257_p05150 [Sinorhizobium fredii USDA 257]
>gi|400752907|ref|YP_006529349.1| hypothetical protein USDA257_p05160 [Sinorhizobium fredii USDA 257]
>gi|400752907|ref|YP_006529349.1| hypothetical protein USDA257_p05160 [Sinorhizobium fredii USDA 257]

Another is:

>gi|218962118|ref|YP_001741893.1| hypothetical protein CLOAM1860 [Candidatus Cloacimonas acidaminovorans str. Evry]
>gi|218962119|ref|YP_001741894.1| hypothetical protein; putative signal peptide [Candidatus Cloacamonas acidaminovorans]
>gi|218962119|ref|YP_001741894.1| hypothetical protein; putative signal peptide [Candidatus Cloacimonas acidaminovorans str. Evry]
>gi|218962120|ref|YP_001741895.1| putative Periplasmic serine proteases [Candidatus Cloacamonas acidaminovorans]
>gi|218962120|ref|YP_001741895.1| putative Periplasmic serine proteases [Candidatus Cloacimonas acidaminovorans str. Evry]
>gi|218962121|ref|YP_001741896.1| hypothetical protein CLOAM1863 [Candidatus Cloacamonas acidaminovorans]
>gi|218962121|ref|YP_001741896.1| hypothetical protein CLOAM1863 [Candidatus Cloacimonas acidaminovorans str. Evry]
>gi|218962122|ref|YP_001741897.1| conserved hypothetical protein [Candidatus Cloacimonas acidaminovorans str. Evry]
>gi|218962122|ref|YP_001741897.1| hypothetical protein CLOAM1864 [Candidatus Cloacamonas acidaminovorans]

So I think only select one sequence should be fine. What do you think? Thanks!

ADD REPLY
0
Entering edit mode

I think the second set might be different sequences.

For set 1, you might wanna double check they are identical sequences before deduplicating them

For set 2, you're better off substituting the blank spaces in the IDs with a placeholder (say, a double underscore). makeblastdb complains because it splits the ID by white space and takes the second part as sequence description (not ID). This create complications as after the split, you end up with sequences with the same ID.

ADD REPLY

Login before adding your answer.

Traffic: 1545 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6