Question: Change NCBI fasta file headers to makeblastdb format
gravatar for chland
2.1 years ago by
chland0 wrote:

Hi, I’ve downloaded several assemblies from RefSeq and will be generating a custom database using the makeblastdb command. The file headers need to be in a specific format to subsequently fetch individual sequences from blastn_out search results using blastdbcmd.

The headers for all fasta and multifasta files need to be formatted something like this: >gnl|uniqID|seq1 The assembly files downloaded from NCBI assembly site have headers like this:

>NZ_BP006234.1 Organism name strain AC2-110 genome
>NZ_BP006234.1 Organism name strain AC2-110 plasmid pxxx5, complete sequence
or like this:
>NZ_BB45345435.1 Organism name 73645 n_819_l_244_c_44.200821, whole genome shotgun sequence
>NZ_BB45345435.1 Organism name 73645  n_773_l_201_c_51.631840, whole genome shotgun sequence
They should look like this:

I think that the sed command can be used in this case but don't know what to provide so that it removes part and keeps the correct part of the string. Thank you for helping.

headers blast format assembly fasta • 1.0k views
ADD COMMENTlink modified 2.1 years ago by h.mon29k • written 2.1 years ago by chland0

What have you tried so far?

ADD REPLYlink written 2.1 years ago by Joe16k

What's your makeblastdb cmdline looking like? And while you're at it can you also post part of your fasta file?

ADD REPLYlink written 2.1 years ago by lieven.sterck7.8k

the makeblastdb isn't the issue as I can get that to run. It's that the --parseseq ids isn't parsing the files correctly. After much review of the NCBI BLASt documentation, it appears that the issue is headers- they're a mess when downloaded from the NCBI assemblies site. I need to get rid of the spaces and the long names to have the files parsed correctly using blastdbcmd. As for the fasta, it's a straight forward fasta or multifasta file, ex) AAACCTCGGCCC of lengths between 200 bp- whole genome assemblies of ~ 4 MB.

ADD REPLYlink written 2.1 years ago by chland0

which blast version are you running? I never seem to have had any trouble formatting blastDBs from fasta files with headers as you mention in your post.

ADD REPLYlink written 2.1 years ago by lieven.sterck7.8k
gravatar for h.mon
2.1 years ago by
h.mon29k wrote:

I have no problem with the following:

gunzip GCF_000820525.2_SMSRO_2016_genomic.fna.gz 
makeblastdb -dbtype nucl -in GCF_000820525.2_SMSRO_2016_genomic.fna -out S.poulsonii -parse_seqids
blastdbcmd -db S.poulsonii -entry NZ_JTLV02000002.1

Are the accessions you provided as example real? I can't find them. I hope they are not made up - they are not good reproducible examples if they are made up.

ADD COMMENTlink written 2.1 years ago by h.mon29k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1101 users visited in the last hour