Question

using efetch in Python

1

Entering edit mode

17 months ago

limitless ▴ 10

Hello everyone,

I sent a question earlier today about finding aliases, and someone was kind enough to provide quite a bit of help that sent me on the right track. My only issue is that the help was not in Python, and I have not been able at recreating the script on my system all day. This is the code from the answer I got earlier today.

$ esearch -db gene -query "TP53 [gene]" | efetch -format ft 

1. TP53
Official Symbol: TP53 and Name: tumor protein p53 [Homo sapiens (human)]
Other Aliases: BCC7, BMFS5, LFS1, P53, TRP53
Other Designations: cellular tumor antigen p53; antigen NY-CO-13; mutant tumor protein 53; phosphoprotein p53; transformation-related protein 53; tumor protein 53; tumor supressor p53
Chromosome: 17; Location: 17p13.1
Annotation: Chromosome 17 NC_000017.11 (7668421..7687490, complement)
MIM: 191170
ID: 7157

Since then I have tried multiple different things, here are some examples.

summarysearch = Entrez.esearch(db = "gene", term = "TP53",retmax = "2")
genesummary = Entrez.read(summarysearch)
print(genesummary)


id_list = genesummary['IdList']
handle = Entrez.efetch(db = 'gene', id = id_list, rettype = 'db', retmode = 'text' )
print(handle.readline().strip())

and all I've gotten as output is

LOCUS       ON745601                1503 bp    mRNA    linear   PRI 30-OCT-2022

I really do need the fetch to give me the summary just like in the first example that was provided by GenoMax, and would really appreciate some help. I would also like to move the aliases to their own variable if that is possible as well.

Thank you!

Edit:

I have now tried this, and I am getting a little bit more information in a list, however, I am still missing the Aliases part that I wanted to begin with.

handle = Entrez.esearch(db = "nucleotide", term = 'TP53', retmax = "1")
rec_list = Entrez.read(handle)
handle.close()
print(rec_list['Count'])
print(len(rec_list['IdList']))
print(rec_list)

id_list = rec_list['IdList']
handles = Entrez.efetch(db = 'nucleotide', id = id_list, rettype = 'fasta')

recs = list(SeqIO.parse(handles, 'fasta' ))
handles.close()
print(recs)

ncbi python entrez • 1.7k views

ADD COMMENT • link updated 17 months ago by Wayne ★ 2.0k • written 17 months ago by limitless ▴ 10

score 3 · Accepted Answer · 2022-11-04

What GenoMax provided shows you how to do this. However, it seems you aren't taking the time to translate it all to Python code. If you search terms related to the the second part of his code, you should get something helpful for understanding it. For example, my search with efetch format ft shows things such as this between the lines below:

From Entrez Direct: E-utilities on the UNIX Command Line

"by J Kans · Cited by 179 — The actual 5-column feature table representation of any sequence record can be obtained directly by using "efetch -format ft". Advanced Topics. Storing"

From [Table 1, – Valid values of &retmode and &rettype for EFetch ...:

"Valid values of &retmode and &rettype for EFetch (null = empty string) ... Feature table, ft, text. ..."

From How do I set filter for NCBI esearch to get fasta for Genes only?:

"Oct 25, 2018 — In a situation like that, you can use the -format ft of efetch to first get the feature table; ..."

Putting that together, you'll find that part of the command says to take what the query in the first yields and use efetch to get the 5 column feature table.

In fact, on that same search page I saw Is there a built-in parser for the textual feature table that Entrez .... That post an example can be found here of specifying the format as 'feature table'. Specifically with rettype='ft'.

And stitching the code together:

from Bio import Entrez
Entrez.email = <email_goes_here>
search_term='TP53 [gene]'
handle = Entrez.esearch(db = "gene", term = search_term, retmax = "1")
rec_list = Entrez.read(handle)
handle.close()
print(rec_list['Count'])
print(len(rec_list['IdList']))
print(rec_list['IdList'])


id_list = rec_list['IdList']
#handles = Entrez.efetch(db = 'gene', id = id_list, rettype = 'ft')
#-----Below is building from OP & GenoMax code(above) to use GenoMax pointer to get the feature table & then parse feature table for aliases--------------

with Entrez.efetch(db='gene', id = id_list, rettype='ft') as handle:
    feature_table = handle.read()

#print(type(feature_table))
# OPTIONAL: If you'd like/need feature table saved as a text file include the next two lines:
with open('feature_table.txt', 'bw') as f:
    f.write(feature_table)

feature_table_text = feature_table.decode("utf-8") #based on https://stackoverflow.com/a/606199/8508004   
#print(feature_table_text) 
other_aliases_text = feature_table_text.split('Other Aliases: </dt><dd class="desig">',1)[1].split('</dd>',1)[0]
other_aliases = other_aliases_text.split(", ") #other aliases as a list `other_aliases`
print(f"These are the other aliases of search term '{search_term}':\n{', '.join(other_aliases)}")

You'll see the result:

362
1
['7157']
These are the other aliases of search term 'TP53 [gene]':
BCC7, BMFS5, LFS1, P53, TRP53