Question: How do I load more than 200 nucleotide EST sequences into fasta files from NCBI search?
0
gravatar for steven
5.3 years ago by
steven70
United States
steven70 wrote:

example: http://www.ncbi.nlm.nih.gov/nucest/?term=txid6200[Organism:exp]

I would like to load all of the sequences from that search into one fasta file. I know that the entrez utilities exist, but they are not installed on the server I am working in. 

Also how does entrez output to a file? I know I would want something like the code below, but I don't want to flood the terminal if I do install this.

esearch -db est -query "txid6200[Organism:exp] " | \
efetch -format fasta
ADD COMMENTlink modified 5.3 years ago • written 5.3 years ago by steven70
1
Have a look at this post. C: Obtaining sequence from Bioproject IDs using biopython gives unknown sequence it should help. I can help you with biopython. Also your link doesn't go to the actually results page.
ADD REPLYlink written 5.3 years ago by Kirill290

Whoops sorry about that. I'm familiar with python, but is there any way to use biopython for the function i need without using entrez?

ADD REPLYlink written 5.3 years ago by steven70
1

You can use biopython to access entrez direction:

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc108

ADD REPLYlink written 5.3 years ago by anderspitman60
2
gravatar for steven
5.3 years ago by
steven70
United States
steven70 wrote:

Problem solved with BioPython! Thanks for the help.

Edit, here's my code:

# command line usage: python entrez.py database searchterm output.fasta

from Bio import Entrez, SeqIO
import sys
import os

dataBase = sys.argv[1]
searchTerm = sys.argv[2]
outFile = sys.argv[3]

Entrez.email = "your.email@goes.here"
handle = Entrez.esearch(db = dataBase, retmax = 100000, term = searchTerm)
record = Entrez.read(handle)
handle.close()
with open(outFile, 'w') as w:
    for id in record["IdList"]:
        fetch_handle = Entrez.efetch(db = dataBase, id = id, rettype = "fasta", retmode="text")
        fetch_record = SeqIO.read(fetch_handle, "fasta")
        fetch_handle.close()
        SeqIO.write(fetch_record, "current_seq.fasta", "fasta")
        for line in open('current_seq.fasta'):
            w.write(line)
os.remove("current_seq.fasta")
ADD COMMENTlink modified 5.3 years ago • written 5.3 years ago by steven70

@Steve Do you mind posting some py code as your answer? I am personally quite interested to see the code, but more importantly I feel your code will benefit the greater community. And you yourself can point people to your answer if they have similar problem. I believe the key aspect about this and alike forums to seek answers and by posting your code you will enable others to find answers and/or provide starting point.     

ADD REPLYlink written 5.3 years ago by Kirill290

@Kirill, No problem, I edited in my code. It's a bit dirty - SeqIO overwrites the file each write, and a quick work around I thought of was to copy the contents of the working fasta to a final output fasta after each write. I'm sure there's a more efficient way to do this; it took a couple hours to download ~100,000 sequences.

ADD REPLYlink modified 5.3 years ago • written 5.3 years ago by steven70

@Steve This is great ! I'm sure a lot of other users will find it very useful.

ADD REPLYlink written 5.3 years ago by Kirill290
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1977 users visited in the last hour