I have a long list (60000+-) of Ensembl IDs, and I want to convert them into Gene Symbols and write them into a txt file. It takes me a few hours to complete only 20000++ IDs. Can anyone please tell what is the problem? Below is a part of my code, the lines
in the code is the list of Ensembl IDs, as shown in the picture
#Coding:
import mygene
mg = mygene.MyGeneInfo()
getgenedata={}
getgenesymbol={}
k=0
while k<100:
newfile=open('data collector3.txt','a')
getgenedata[k]=mg.getgene(lines[k],fields='gene symbol') #output : {'_id': '7105', '_version': 2, 'symbol': 'TSPAN6'}
if getgenedata[k] != None: # this part is to remove the _id and _version as I do not need them
getgenesymbol[k]=getgenedata[k].get('symbol')
newfile.write(str(getgenesymbol[k]))
newfile.write('\n')
else:
pass
k+=1
newfile.close()
use biomart to get a file containing the gene ID and their symbols, sort your file and the ensembl file on the ID and use join. https://linux.die.net/man/1/join
This has been asked so many times before, please use the search function and google for it, e.g. Translating gene names to entrez id's