How do I load more than 200 nucleotide EST sequences into fasta files from NCBI search?
1
0
Entering edit mode
8.9 years ago
steven ▴ 70

Example: http://www.ncbi.nlm.nih.gov/nucest/?term=txid6200[Organism:exp]

I would like to load all of the sequences from that search into one fasta file. I know that the entrez utilities exist, but they are not installed on the server I am working in.

Also how does entrez output to a file? I know I would want something like the code below, but I don't want to flood the terminal if I do install this.

esearch -db est -query "txid6200[Organism:exp] " | \
efetch -format fasta
ncbi fasta entrez • 4.0k views
ADD COMMENT
1
Entering edit mode
Have a look at this post. C: Obtaining sequence from Bioproject IDs using biopython gives unknown sequence it should help. I can help you with biopython. Also your link doesn't go to the actually results page.
ADD REPLY
0
Entering edit mode

Whoops sorry about that. I'm familiar with python, but is there any way to use biopython for the function i need without using entrez?

ADD REPLY
1
Entering edit mode

You can use biopython to access entrez direction:

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc108

ADD REPLY
2
Entering edit mode
8.9 years ago
steven ▴ 70

Problem solved with BioPython! Thanks for the help.

Edit, here's my code:

# command line usage: python entrez.py database searchterm output.fasta

from Bio import Entrez, SeqIO
import sys
import os

dataBase = sys.argv[1]
searchTerm = sys.argv[2]
outFile = sys.argv[3]

Entrez.email = "your.email@goes.here"
handle = Entrez.esearch(db = dataBase, retmax = 100000, term = searchTerm)
record = Entrez.read(handle)
handle.close()
with open(outFile, 'w') as w:
    for id in record["IdList"]:
        fetch_handle = Entrez.efetch(db = dataBase, id = id, rettype = "fasta", retmode="text")
        fetch_record = SeqIO.read(fetch_handle, "fasta")
        fetch_handle.close()
        SeqIO.write(fetch_record, "current_seq.fasta", "fasta")
        for line in open('current_seq.fasta'):
            w.write(line)
os.remove("current_seq.fasta")
ADD COMMENT
0
Entering edit mode

@Steve Do you mind posting some py code as your answer? I am personally quite interested to see the code, but more importantly I feel your code will benefit the greater community. And you yourself can point people to your answer if they have similar problem. I believe the key aspect about this and alike forums to seek answers and by posting your code you will enable others to find answers and/or provide starting point.

ADD REPLY
0
Entering edit mode

@Kirill, No problem, I edited in my code. It's a bit dirty - SeqIO overwrites the file each write, and a quick work around I thought of was to copy the contents of the working fasta to a final output fasta after each write. I'm sure there's a more efficient way to do this; it took a couple hours to download ~100,000 sequences.

ADD REPLY
0
Entering edit mode

@Steve This is great ! I'm sure a lot of other users will find it very useful.

ADD REPLY

Login before adding your answer.

Traffic: 2681 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6