Question: Python Code to standardize gene name in CSV file
gravatar for Bioinformatician_in_trouble
5 months ago by

Hi all, I have a CSV file that includes gene names in different formats for example: PD-L1 might be written as PDL1 or PD-L1 or PDL-1 but i want to standardize all of these to HGNC symbols using python code. Can anyone tell me how should I go about it. I know how to do it for just one gene name ( i can manually look for the symbol in HGNC and replace it) But my issue is I might have many gene names. So i want the code to look for the gene name in my csv file and automatically fetch the HGNC symbol for it and replace the existing value with HGNC symbol. Any help would be deeply appreciated. Thank you :)

ADD COMMENTlink modified 3 months ago by Biostar ♦♦ 20 • written 5 months ago by Bioinformatician_in_trouble10

Sounds like you're just going to need lots of regexs..

ADD REPLYlink modified 5 months ago • written 5 months ago by jrj.healey2.0k

To elaborate, I would start broad. It depends what sort of input info you have as you only gave us one example but you might be able to do some really simple 'space reduction' first. For example, a regex to remove all hyphens, a regex to remove any spaces/newlines/punctuation etc. Once you've got all the gene IDs in some kind of consistent format (e.g. purely alphanumeric) you might be able to play with specific formats more easily.

ADD REPLYlink written 5 months ago by jrj.healey2.0k

I can use regex for symbols. But in general i would want to replace all gene names as is given in HGNC. PD-L1 was just an example.

ADD REPLYlink written 5 months ago by Bioinformatician_in_trouble10

are you sure you want HGNC symbols? Cause for PD-L1 it is CD274 and not just PDL1 or similar. What exactly do you want and why? Could you please provide more examples of gene names in your input file and desired output? What species used (human?) and by and any idea of where this names are coming from?

ADD REPLYlink written 5 months ago by Petr Ponomarenko2.4k

Yes, HGNC symbols. For example: My CSV file may have "TS" as an input but HGNC symbol for that is TYMS. So, I want TYMS as output. But I do not want to replace each gene symbol manually. Species is Human and the genes names are coming from Pubmed abstracts( so different authors refer genes differently, I want all of them be the HGNC ones).I hope this is clear.

ADD REPLYlink written 5 months ago by Bioinformatician_in_trouble10

This is clear. This is an interesting task to solve.

Not sure if there is a standard good solution if it is any name in PubMed abstract.

Could you please upload an input file somewhere and share it here? I will take a look over the weekend.

Do you have any means to validate at least manually (but by a professional) if the output generated is correct? Is there a possibility that some genes mentioned are from model organisms for human and if yes, can you extract info from abstract on which species were used for each gene name and each abstract. Do you want human ortholog name or original gene's HGNC from that model organism? By any chance, do you have access to the whole publication text or transcript IDs extracted from them?

Is this a research task for public university/institute or a commercial application?

ADD REPLYlink written 5 months ago by Petr Ponomarenko2.4k

Have you tried as a half-way solution? You could screen scrap this resource with various python libraries since I don't think there is a set API.

ADD REPLYlink written 3 months ago by Garan300

I did not know about symbol checker.Thank you for the info.

ADD REPLYlink written 12 weeks ago by Bioinformatician_in_trouble10

If your example is at all literal (if it ends up being a dash in different places in the gene), I would just run that column in the CSV through string.replace('-', ''), so now all of the instances you gave would be "PDL1" and you could match your gene to that.

I'm guessing it's probably not that simple. So regexs would probably be the way to go.

ADD REPLYlink written 5 months ago by Bill Wysocki100
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1303 users visited in the last hour