As you may know, when converting gene names into all the different terminologies, we sometimes lose some info, because not all database contain all genes, there are synonims, new genes... In my case, I have a list of differentially expressed genes in Ensembl annotation and I want to convert it to their gene symbol so that it is human readable. I tried 2 different ways and I generated 3 columns with the corresponding symbol of each Ensembl. The file (.csv) could look like this, (this is a fake example):
Ensembl | Method1 | Method2 1. ENSMUS0000001 | Htt | NA | 2. ENSMUS0000002 | Socs3 | SocsX | 3. ENSMUS0000003 | NA | Jak2 | 4. ENSMUS0000004 | NA | NA |
Then I would like to merge into a single column with an script that go throught all rows. The behavior I expect for the program is "if the name in the Method1 column in not NA, take it no matter what the other symbol is (case 1 and 2). If the name in this column is NA and the name in the Method2 is not, take this other (case 3). If both are NA, keep the Ensembl (case 4)".
Could someone give me a little bit of light with the code? Is it okay to use a bash script or should I use another powerful language for this?
Thank you in advance.