Question: To get the name of the strains by searching assembly genome number GCF_
0
gravatar for horsedog
20 months ago by
horsedog40
horsedog40 wrote:

I have a bunch of refseq assembly genome number likeGCF_002514765.1,GCF_002485085.1,GCF_002201835.1,GCF_000593305.2,GCF_001887655.1,GCF_000194215.1,GCF_002098145.1,GCF_002807875.1

Now I want to use these to search which genome it is , for example, the first one is Escherichia coli strain MOD1-EC3823, I try to use efetch to achieve this, but seems it does not work, it says "urllib.error.HTTPError: HTTP Error 400: Bad Request" here is my python code:

from Bio import Entrez
Entrez.email = "hulala@gmail.com"
ID = open("assembly_ID").read()
handle = Entrez.efetch(db="assembly", id= ID, rettype="gb")
print(handle.read())

Does anyone have any idea?

efetch python ncbi • 672 views
ADD COMMENTlink modified 20 months ago by Joseph Hughes2.7k • written 20 months ago by horsedog40
0
gravatar for Joseph Hughes
20 months ago by
Joseph Hughes2.7k
Scotland, UK
Joseph Hughes2.7k wrote:

Re-writting the following query in python should get you what you want:

esearch -db assembly -query "GCF_002514765.1" | esummary | xtract -pattern DocumentSummary -element SpeciesName Sub_type Sub_value

The output is:

Escherichia coli    strain  MOD1-EC3823
ADD COMMENTlink written 20 months ago by Joseph Hughes2.7k

Hi , thanks , but it says "SyntaxError: invalid syntax" at Sub_value do you mean by replacing

ID = open("assembly_ID").read()
handle = Entrez.efetch(db="assembly", id= ID, rettype="gb")

by your code? but here the -query is not just one ID, there are thousands of

ADD REPLYlink written 20 months ago by horsedog40

you will need to do a loop in your python code to query each accession one at a time.

ADD REPLYlink written 20 months ago by Joseph Hughes2.7k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1133 users visited in the last hour