BLAST reference genome indexing
1
0
Entering edit mode
6 weeks ago
bhumm ▴ 30

For some softwares I have had to index a reference genome prior to using blastn. Here is an example command:

makeblastdb –in mydb.fsa –dbtype nucl –parse_seqids


Whenever I do this I retain the original fasta file plus a bunch of extra files made by the command. For example:

mydb.nhr, mydb.nin, mydb.nsd, mydb.nsi, etc.

I haven't found very clear documentation as to what is happening with this command and what file is the final indexed genome that I should be using. Any links, information, or explanation on this is greatly appreciated.

blastn fasta shell • 365 views
0
Entering edit mode
1
Entering edit mode
6 weeks ago
GenoMax 127k

makeblastdb is creating the index from mydb.fsa (which is your reference) file. This results in the set of files you name above. mydb is the basename for your blast database and should be used with -dboption.

BLAST+ provides a tool called makeblastdb that converts a subject FASTA file into an indexed and quickly searchable (but not human-readable) version of the same information, stored in a set of similarly named files (often at least three ending in .pin, .psq, and .phr for protein sequences, and .nin, .nsq, and .nhr for nucleotide sequences). This set of files represents the “database,” and the database name is the shared file name prefix of these files.

Files contain the following information LINK:

nhr: deflines
nin: indices
nsq: sequence data
nnd: GI data
nni: GI indices
nsd: non-GI data
nsi: non-GI indices

0
Entering edit mode

Thanks for the explanation and links. So when calling the database in the for use, I call the prefix of all the 'subfiles' which invokes the fully indexed database?

0
Entering edit mode

That is correct.