Question

How do I load more than 200 nucleotide EST sequences into fasta files from NCBI search?

0

Entering edit mode

8.9 years ago

steven ▴ 70

Example: http://www.ncbi.nlm.nih.gov/nucest/?term=txid6200[Organism:exp]

I would like to load all of the sequences from that search into one fasta file. I know that the entrez utilities exist, but they are not installed on the server I am working in.

Also how does entrez output to a file? I know I would want something like the code below, but I don't want to flood the terminal if I do install this.

esearch -db est -query "txid6200[Organism:exp] " | \
efetch -format fasta

ncbi fasta entrez • 4.0k views

ADD COMMENT • link updated 16 months ago by Ram 43k • written 8.9 years ago by steven ▴ 70

1

Entering edit mode

Have a look at this post. C: Obtaining sequence from Bioproject IDs using biopython gives unknown sequence it should help. I can help you with biopython. Also your link doesn't go to the actually results page.

ADD REPLY • link 8.9 years ago by Kirill Tsyganov ▴ 370

0

Entering edit mode

Whoops sorry about that. I'm familiar with python, but is there any way to use biopython for the function i need without using entrez?

ADD REPLY • link 8.9 years ago by steven ▴ 70

1

Entering edit mode

You can use biopython to access entrez direction:

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc108

ADD REPLY • link 8.9 years ago by anderspitman ▴ 70

Ram · Accepted Answer · 2015-06-09

2

Entering edit mode

8.9 years ago

steven ▴ 70

Problem solved with BioPython! Thanks for the help.

Edit, here's my code:

# command line usage: python entrez.py database searchterm output.fasta

from Bio import Entrez, SeqIO
import sys
import os

dataBase = sys.argv[1]
searchTerm = sys.argv[2]
outFile = sys.argv[3]

Entrez.email = "your.email@goes.here"
handle = Entrez.esearch(db = dataBase, retmax = 100000, term = searchTerm)
record = Entrez.read(handle)
handle.close()
with open(outFile, 'w') as w:
    for id in record["IdList"]:
        fetch_handle = Entrez.efetch(db = dataBase, id = id, rettype = "fasta", retmode="text")
        fetch_record = SeqIO.read(fetch_handle, "fasta")
        fetch_handle.close()
        SeqIO.write(fetch_record, "current_seq.fasta", "fasta")
        for line in open('current_seq.fasta'):
            w.write(line)
os.remove("current_seq.fasta")

ADD COMMENT • link 8.9 years ago by steven ▴ 70

0

Entering edit mode

@Steve Do you mind posting some py code as your answer? I am personally quite interested to see the code, but more importantly I feel your code will benefit the greater community. And you yourself can point people to your answer if they have similar problem. I believe the key aspect about this and alike forums to seek answers and by posting your code you will enable others to find answers and/or provide starting point.

ADD REPLY • link updated 16 months ago by Ram 43k • written 8.9 years ago by Kirill Tsyganov ▴ 370

0

Entering edit mode

@Kirill, No problem, I edited in my code. It's a bit dirty - SeqIO overwrites the file each write, and a quick work around I thought of was to copy the contents of the working fasta to a final output fasta after each write. I'm sure there's a more efficient way to do this; it took a couple hours to download ~100,000 sequences.