Question

makeblastdb problem for bacterial genomes

0

Entering edit mode

10.7 years ago

niu2rseq ▴ 90

Hi

I am trying to make a blast database for my metatranscriptomic data.

I downloaded the whole bacterial_genome and bacterial_draft folder from NCBI ftp. Then I merged all faa sequences into one big file all.fasta which contains all the protein sequences from these two folders. It's huge, 11G.

I was trying to make a prot database since I want to use blastx with my data against this database.

My command is:

module load blast+/2.2.30
makeblastdb -in Bacteria_all.fasta -out Bacterial_all_blastDB -dbtype prot -parse_seqids

But the problem is there are redundancy in this big fasta file so I got error for this job:

BLAST Database creation error: Error: Duplicate seq_ids are found: 
REF|YP_001740126.1|

I checked the data and find:

$ grep "YP_001740126.1|" Bacteria_genome_all_faa.fasta 
>gi|218960351|ref|YP_001740126.1| chromosomal replication initiation protein [Candidatus Cloacimonas acidaminovorans str. Evry]
>gi|218960351|ref|YP_001740126.1| chromosomal replication initiation protein [Candidatus Cloacamonas acidaminovorans]

Did anybody knows any method/tools to solve this problem? Or you would like to suggest download some other built database for my purpose? Thank you very much!

blast makeblastdb • 7.3k views

ADD COMMENT • link updated 3.4 years ago by Ram 45k • written 10.7 years ago by niu2rseq ▴ 90

2

Entering edit mode

See How To Remove The Same Sequences In The Fasta Files?, How To Remove The Identical Sequences In The Fasta Files?, ....

ADD REPLY • link updated 3.4 years ago by Ram 45k • written 10.7 years ago by Pierre Lindenbaum 166k

2

Entering edit mode

I would check the following first: every sequence is most likely in the NR database already. You could download NR and provide blastdb_aliastool with a gi list of all sequences to include, that should save you some work, you just need to generate the gi list.

ADD REPLY • link updated 3.4 years ago by Ram 45k • written 10.7 years ago by Michael 56k

0

Entering edit mode

Thank you! So shall I download the nr database first?

ftp://ftp.ncbi.nlm.nih.gov/blast/db/ shall I download all these NR tgz files and merge all these together?

ADD REPLY • link updated 3.4 years ago by Ram 45k • written 10.7 years ago by niu2rseq ▴ 90

0

Entering edit mode

Please use update_blastdb.pl nr to download the archives. After download, you simply extract them with tar xzf.

ADD REPLY • link updated 3.4 years ago by Ram 45k • written 10.7 years ago by Michael 56k

0

Entering edit mode

Where to get this perl script?

ADD REPLY • link updated 3.4 years ago by Ram 45k • written 10.7 years ago by niu2rseq ▴ 90

0

Entering edit mode

it is included with the blast binaries

ADD REPLY • link updated 3.4 years ago by Ram 45k • written 10.7 years ago by Michael 56k

1

Entering edit mode

Is it necessary to retain both/all sequences with matching IDs, or can you afford to de-duplicate them?

ADD REPLY • link 3.4 years ago by Ram 45k

0

Entering edit mode

I think maybe I can try to get all unique gi and then just get these sequences?

ADD REPLY • link updated 3.4 years ago by Ram 45k • written 10.7 years ago by niu2rseq ▴ 90

0

Entering edit mode

That should work. What you need to figure out if how do you choose which sequence to pick given a GI matches two different sequences.

ADD REPLY • link 3.4 years ago by Ram 45k

0

Entering edit mode

I am getting a count file for all gis here:

grep ">.+\|\d+\|\w+\|.+\|\s" Bacteria_genome_all_faa.fasta |cut -d " " -f 1|sort|uniq -d > repeated_BG_gi.t

ADD REPLY • link updated 3.4 years ago by Ram 45k • written 10.7 years ago by niu2rseq ▴ 90

0

Entering edit mode

Should you not be using uniq -c?

Now that you have unique GIs, when a GI matches 2 different sequences, which one are you gonna pick? You might just be better off replacing blank spaces with double underscores (__) in the identifier. That way, you shouldn't face this problem with makeblastdb.

ADD REPLY • link 3.4 years ago by Ram 45k

0

Entering edit mode

HI I tried this

grep --color -F -f repeated_BG_gi.txt Bacteria_genome_all_faa.fasta |sort -k 1

And what I got is generally two types of replicates:

>gi|400752905|ref|YP_006529347.1| hypothetical protein USDA257_p05140 [Sinorhizobium fredii USDA 257]
>gi|400752905|ref|YP_006529347.1| hypothetical protein USDA257_p05140 [Sinorhizobium fredii USDA 257]
>gi|400752906|ref|YP_006529348.1| hypothetical protein USDA257_p05150 [Sinorhizobium fredii USDA 257]
>gi|400752906|ref|YP_006529348.1| hypothetical protein USDA257_p05150 [Sinorhizobium fredii USDA 257]
>gi|400752907|ref|YP_006529349.1| hypothetical protein USDA257_p05160 [Sinorhizobium fredii USDA 257]
>gi|400752907|ref|YP_006529349.1| hypothetical protein USDA257_p05160 [Sinorhizobium fredii USDA 257]

Another is:

>gi|218962118|ref|YP_001741893.1| hypothetical protein CLOAM1860 [Candidatus Cloacimonas acidaminovorans str. Evry]
>gi|218962119|ref|YP_001741894.1| hypothetical protein; putative signal peptide [Candidatus Cloacamonas acidaminovorans]
>gi|218962119|ref|YP_001741894.1| hypothetical protein; putative signal peptide [Candidatus Cloacimonas acidaminovorans str. Evry]
>gi|218962120|ref|YP_001741895.1| putative Periplasmic serine proteases [Candidatus Cloacamonas acidaminovorans]
>gi|218962120|ref|YP_001741895.1| putative Periplasmic serine proteases [Candidatus Cloacimonas acidaminovorans str. Evry]
>gi|218962121|ref|YP_001741896.1| hypothetical protein CLOAM1863 [Candidatus Cloacamonas acidaminovorans]
>gi|218962121|ref|YP_001741896.1| hypothetical protein CLOAM1863 [Candidatus Cloacimonas acidaminovorans str. Evry]
>gi|218962122|ref|YP_001741897.1| conserved hypothetical protein [Candidatus Cloacimonas acidaminovorans str. Evry]
>gi|218962122|ref|YP_001741897.1| hypothetical protein CLOAM1864 [Candidatus Cloacamonas acidaminovorans]

So I think only select one sequence should be fine. What do you think? Thanks!

ADD REPLY • link updated 3.4 years ago by Ram 45k • written 10.7 years ago by niu2rseq ▴ 90

0

Entering edit mode

I think the second set might be different sequences.

For set 1, you might wanna double check they are identical sequences before deduplicating them

For set 2, you're better off substituting the blank spaces in the IDs with a placeholder (say, a double underscore). makeblastdb complains because it splits the ID by white space and takes the second part as sequence description (not ID). This create complications as after the split, you end up with sequences with the same ID.

ADD REPLY • link 3.4 years ago by Ram 45k