Question: Change NCBI fasta file headers to makeblastdb format
0
gravatar for chland
13 months ago by
chland0
chland0 wrote:

Hi, I’ve downloaded several assemblies from RefSeq and will be generating a custom database using the makeblastdb command. The file headers need to be in a specific format to subsequently fetch individual sequences from blastn_out search results using blastdbcmd.

The headers for all fasta and multifasta files need to be formatted something like this: >gnl|uniqID|seq1 The assembly files downloaded from NCBI assembly site have headers like this:

>NZ_BP006234.1 Organism name strain AC2-110 genome
>NZ_BP006234.1 Organism name strain AC2-110 plasmid pxxx5, complete sequence
or like this:
>NZ_BB45345435.1 Organism name 73645 n_819_l_244_c_44.200821, whole genome shotgun sequence
>NZ_BB45345435.1 Organism name 73645  n_773_l_201_c_51.631840, whole genome shotgun sequence
They should look like this:
 >NZ_BP006234.1|seq1
>NZ_BP006234.1|seq2
>NZ_BP006234.1|seq3

I think that the sed command can be used in this case but don't know what to provide so that it removes part and keeps the correct part of the string. Thank you for helping.

ADD COMMENTlink modified 13 months ago by h.mon25k • written 13 months ago by chland0
1

What have you tried so far?

ADD REPLYlink written 13 months ago by jrj.healey12k

What's your makeblastdb cmdline looking like? And while you're at it can you also post part of your fasta file?

ADD REPLYlink written 13 months ago by lieven.sterck4.8k

the makeblastdb isn't the issue as I can get that to run. It's that the --parseseq ids isn't parsing the files correctly. After much review of the NCBI BLASt documentation, it appears that the issue is headers- they're a mess when downloaded from the NCBI assemblies site. I need to get rid of the spaces and the long names to have the files parsed correctly using blastdbcmd. As for the fasta, it's a straight forward fasta or multifasta file, ex) AAACCTCGGCCC of lengths between 200 bp- whole genome assemblies of ~ 4 MB.

ADD REPLYlink written 13 months ago by chland0

which blast version are you running? I never seem to have had any trouble formatting blastDBs from fasta files with headers as you mention in your post.

ADD REPLYlink written 13 months ago by lieven.sterck4.8k
0
gravatar for h.mon
13 months ago by
h.mon25k
Brazil
h.mon25k wrote:

I have no problem with the following:

wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/820/525/GCF_000820525.2_SMSRO_2016/GCF_000820525.2_SMSRO_2016_genomic.fna.gz
gunzip GCF_000820525.2_SMSRO_2016_genomic.fna.gz 
makeblastdb -dbtype nucl -in GCF_000820525.2_SMSRO_2016_genomic.fna -out S.poulsonii -parse_seqids
blastdbcmd -db S.poulsonii -entry NZ_JTLV02000002.1

Are the accessions you provided as example real? I can't find them. I hope they are not made up - they are not good reproducible examples if they are made up.

ADD COMMENTlink written 13 months ago by h.mon25k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 668 users visited in the last hour