I have many gene names I'm trying to map to entrez ids.
Right now I use the esearch module in biopython to query them 1 by 1 but this takes some time for 30000 gene names and ideally I would like it to be faster. I assume it would be faster if I could query 30000 at once instead of doing 30000 queries.
This is my current implementation:
for line in f.readlines(): line = [lineitem.strip('"') for lineitem in line.strip().split()] gene = line # Search NCBI for existing gene ids gene_id = None handle = Entrez.esearch(db="gene", term="Homo sapiens[orgn] AND "+ gene + "[Gene Name]") record = Entrez.read(handle) try: gene_id = record["IdList"] except: pass handle.close()
this works but I would like a better solution. Is there a better way to approach this?
Kind regards, Julian
convert gene name to entrez id
The correct term for the identifiers you're calling "gene names" is HGNC Gene Symbols. HGNC also has "Gene Names", which are more like descriptions than one-word symbols.