How do I load more than 200 nucleotide EST sequences into fasta files from NCBI search?
6.4 years ago
steven ▴ 70

example: http://www.ncbi.nlm.nih.gov/nucest/?term=txid6200[Organism:exp]

I would like to load all of the sequences from that search into one fasta file. I know that the entrez utilities exist, but they are not installed on the server I am working in.

Also how does entrez output to a file? I know I would want something like the code below, but I don't want to flood the terminal if I do install this.

esearch -db est -query "txid6200[Organism:exp] " | \
efetch -format fasta
Whoops sorry about that. I'm familiar with python, but is there any way to use biopython for the function i need without using entrez?

You can use biopython to access entrez direction:

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc108

6.4 years ago
steven ▴ 70

Problem solved with BioPython! Thanks for the help.

Edit, here's my code:

# command line usage: python entrez.py database searchterm output.fasta

from Bio import Entrez, SeqIO
import sys
import os

dataBase = sys.argv[1]
searchTerm = sys.argv[2]
outFile = sys.argv[3]

Entrez.email = "your.email@goes.here"
handle = Entrez.esearch(db = dataBase, retmax = 100000, term = searchTerm)
handle.close()
with open(outFile, 'w') as w:
for id in record["IdList"]:
fetch_handle = Entrez.efetch(db = dataBase, id = id, rettype = "fasta", retmode="text")
fetch_handle.close()
SeqIO.write(fetch_record, "current_seq.fasta", "fasta")
for line in open('current_seq.fasta'):
w.write(line)
os.remove("current_seq.fasta")

@Kirill, No problem, I edited in my code. It's a bit dirty - SeqIO overwrites the file each write, and a quick work around I thought of was to copy the contents of the working fasta to a final output fasta after each write. I'm sure there's a more efficient way to do this; it took a couple hours to download ~100,000 sequences.

@Steve This is great ! I'm sure a lot of other users will find it very useful.