How can I prepare a local database with sequences from a specific taxon group?
1
0
Entering edit mode
8.6 years ago
sw.knudsen • 0

I would like to prepare a local database that holds sequences from a certain list of species. If possible I would like to restrict the sequence length to be between 0-19000 nt long, to make sure I mainly get mitochondrial DNA data. Can anyone point me to a script , preferably in python, that would accept an input-file with a list of species names, and return me the nucleotide sequences in fasta format for each of those species that fits the sequence length interval of 0-19000 nt length?

e.g. My input list could look like this:

Gadus morhua
Clupea harengus
Scomber scombrus

Thanks in advance

gene sequence • 1.6k views
ADD COMMENT
0
Entering edit mode

Sorry, I should tried harder to explain what I am after.

I only listed three species in my example here above. But my list could easily comprise 10000 species instead of just only 3. For each of those 10000 species I wish to search for the available sequences in the length interval between 0-19000 nt long, and get a list of accession number back.

My intention was to avoid searching manually for each single species through the tax-browser element for each single species, especially if my list of taxa hold 10000 different species from different orders, and families.

Using the tax-id 'vertebrates' will not help me either, as I don't want all vertebrates that have sequences in the length interval between 0-19000 nt deposited. I only want a special hand-picked selection of vertebrates. The handpicked selection of organisms I want comes from a list prepared by me.

The overall idea is that with such a database, prepared from a pre-prepared list by me, it would be possible to blast against geographical regions. For example. The genus Clupea holds both the Atlantic herring (Clupea harengus) and the Pacific herring (Clupea pallasii). If the sequence I want to blast originates from the Atlantic I have no desire to include a Pacific organism. As blasting with a sequence from an Atlantic caught herring could falsely return a Pacific herring.

I want to limit my database contents to only hold what is relevant for my sample i.e. I want to prepare a very local and specific database from a handpicked selected list of organisms. My list of organisms might be very very long, so I don't want to look each species up manually and collect sequences individually .

ADD REPLY
0
Entering edit mode

use ncbi eutils/ esearch with db=taxonomy

ADD REPLY
0
Entering edit mode
8.6 years ago

Go to NCBI and search for the 3 taxons and the length.

Download the sequences as fasta, linearize, insert into any database https://www.sqlite.org/cvstrac/wiki?p=ImportingFiles

ADD COMMENT

Login before adding your answer.

Traffic: 2278 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6