I am trying to make a blast database for my metatranscriptomic data.
I downloaded the whole bacterial_genome and bacterial_draft folder from NCBI ftp. Then I merged all faa sequences into one big file all.fasta which contains all the protein sequences from these two folders. It's huge, 11G.
I was trying to make a prot database since I want to use blastx with my data against this database.
My command is:
module load blast+/2.2.30 makeblastdb -in Bacteria_all.fasta -out Bacterial_all_blastDB -dbtype prot -parse_seqids
But the problem is there are redundancy in this big fasta file so I got error for this job:
BLAST Database creation error: Error: Duplicate seq_ids are found: REF|YP_001740126.1|
I checked the data and find:
$ grep "YP_001740126.1|" Bacteria_genome_all_faa.fasta >gi|218960351|ref|YP_001740126.1| chromosomal replication initiation protein [Candidatus Cloacimonas acidaminovorans str. Evry] >gi|218960351|ref|YP_001740126.1| chromosomal replication initiation protein [Candidatus Cloacamonas acidaminovorans]
Did anybody knows any method/tools to solve this problem? Or you would like to suggest download some other built database for my purpose? Thank you very much!
See How To Remove The Same Sequences In The Fasta Files?, How To Remove The Identical Sequences In The Fasta Files?, ....
I would check the following first: every sequence is most likely in the NR database already. You could download NR and provide blastdb_aliastool with a gi list of all sequences to include, that should save you some work, you just need to generate the gi list.
Thank you! So shall I download the nr database first?
ftp://ftp.ncbi.nlm.nih.gov/blast/db/ shall I download all these NR tgz files and merge all these together?
update_blastdb.pl nrto download the archives. After download, you simply extract them with
Where to get this perl script?
it is included with the blast binaries
Is it necessary to retain both/all sequences with matching IDs, or can you afford to de-duplicate them?
I think maybe I can try to get all unique gi and then just get these sequences?
That should work. What you need to figure out if how do you choose which sequence to pick given a GI matches two different sequences.
I am getting a count file for all gis here:
Should you not be using
Now that you have unique GIs, when a GI matches 2 different sequences, which one are you gonna pick? You might just be better off replacing blank spaces with double underscores (
__) in the identifier. That way, you shouldn't face this problem with
HI I tried this
And what I got is generally two types of replicates:
So I think only select one sequence should be fine. What do you think? Thanks!
I think the second set might be different sequences.
For set 1, you might wanna double check they are identical sequences before deduplicating them
For set 2, you're better off substituting the blank spaces in the IDs with a placeholder (say, a double underscore). makeblastdb complains because it splits the ID by white space and takes the second part as sequence description (not ID). This create complications as after the split, you end up with sequences with the same ID.