I am trying to make a blast database for my metatranscriptomic data.
I downloaded the whole bacterial_genome and bacterial_draft folder from NCBI ftp. Then I merged all faa sequences into one big file all.fasta which contains all the protein sequences from these two folders. It's huge, 11G.
I was trying to make a prot database since I want to use blastx with my data against this database.
My command is:
module load blast+/2.2.30 makeblastdb -in Bacteria_all.fasta -out Bacterial_all_blastDB -dbtype prot -parse_seqids
But the problem is there are redundancy in this big fasta file so I got error for this job:
BLAST Database creation error: Error: Duplicate seq_ids are found: REF|YP_001740126.1|
I checked the data and find:
$ grep "YP_001740126.1|" Bacteria_genome_all_faa.fasta >gi|218960351|ref|YP_001740126.1| chromosomal replication initiation protein [Candidatus Cloacimonas acidaminovorans str. Evry] >gi|218960351|ref|YP_001740126.1| chromosomal replication initiation protein [Candidatus Cloacamonas acidaminovorans]
Did anybody knows any method/tools to solve this problem? Or you would like to suggest download some other built database for my purpose? Thank you very much!