Question

makeblastdb of refseq_genomic for sequences with subspecies

0

Entering edit mode

8.0 years ago

Spencer D. • 0

So I am running a lot of BLAST+ requests on a local refseq_genomic db I have set up in a UNIX environment. My big issue right now is that I only want to look at sequences that have subspecies definitions. i.e I'm interested in Cervus nippon taiouanus, but not a sequence defined just as Cervus nippon. Is there a good way to make a genebank query that will parse out just sequences with subspecies values? I'm having trouble finding info on how to make such a query.

Thanks!

blast python unix • 1.4k views

ADD COMMENT • link updated 8.0 years ago by piet ★ 1.8k • written 8.0 years ago by Spencer D. • 0

0

Entering edit mode

Did you find taxid's in the latest blast+ indexes?
When I recently checked nt using blastdbcmd there was no txaid's as far as I could tell.

ADD REPLY • link 8.0 years ago by GenoMax 142k

score 1 · Answer 1 · 2016-05-18

1

Entering edit mode

8.0 years ago

piet ★ 1.8k

In my experience, taxonomic metadata on sequences in Genbank is not reliable. Submitters are free to assign any taxon they want to a sequence.

In your case, only few sequences are linked to Formosan sika deer, see https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=37550&lvl=3&lin=f&keep=1&srchmode=1&unlock

Thus you will not need any sophisticated procedure to retrieve all of them.

ADD COMMENT • link 8.0 years ago by piet ★ 1.8k

0

Entering edit mode

Hi,

Thanks a lot for the info. I assumed as much for the taxonomic metadata (I've observed differences at the genus level even between sequence names and GB entries). I'm actually looking to iteratively search the entirety of the refseq_genomic DB. I devised a hacky workaround using a python script. I pulled all of the sequences for refseq from genbank and and split them into individual files named by sequence name. Then I os.listdir()'d all the sequences and selected only those that fit my criteria for subspecies (regex'd from the filename). Then I made a blastable db from this list of sequences (after compiling them back to a single .fsa file) that I then used for querying.

Probably not the most comprehensive way to do it, but it yielded a lot more usable data from a much smaller data set than I was previously working with. I'm going to have to go through the results of my data-crunching to ensure I used the right sequences, but I was able to shorten the number of sequences I was working through from >6000 (mitochondrial sequences only) to <300.

ADD REPLY • link 8.0 years ago by Spencer D. • 0