What is the best way to match gene id with gene name.
3
2
Entering edit mode
9.4 years ago
M K ▴ 660

Hi All,

I have a list of gene id's and I want to match those with gene names. For example this is a small part of the list as shown:

Gene id             strand
ENSG00000242959     1
ENSG00000160396     -1
ENSG00000229494     1
ENSG00000230262     -1
ENSG00000229240     -1
ENSG00000223569     1
ENSG00000214262     -1
ENSG00000256306     1
ENSG00000260580     1
ENSG00000267466     -1
ENSG00000180878     1
ENSG00000233177     1
ENSG00000240567     1
ENSG00000233436     -1
ENSG00000257826     -1
ENSG00000233115     -1
ENSG00000256642     -1
ENSG00000229438     1
ENSG00000254429     1
ENSG00000184258     -1
RNA-Seq R • 17k views
ADD COMMENT
6
Entering edit mode
9.4 years ago
komal.rathi ★ 4.1k

If you know the Ensembl build that you used to get the Gene IDs, you can easily get the corresponding Gene Names from Biomart or even the GTF file of the same build. If you want to do it in R, use the biomaRt package.

Option 1: Using awk to get Gene ID and Name from GTF file:

You could use awk to get associated Gene ID & Gene Names from the GTF file:

awk '{                                
    for (i = 1; I <= NF; i++) {
        if ($i ~ /gene_id|gene_name/) {
            printf "%s ", $(i+1)
        }
    }
    print ""
}' Homo_sapiens.GRCh37.70.gtf | sed -e 's/"//g' -e 's/;//g' -e 's/ /\t/' | sort -k1,1 | uniq > Homo_sapiens.GRCh37.70.txt

Read this file in R

> annot = read.delim('Homo_sapiens.GRCh37.70.txt', header=F)

> head(annot)
               V1        V2
1 ENSG00000000003   TSPAN6 
2 ENSG00000000005     TNMD 
3 ENSG00000000419     DPM1 
4 ENSG00000000457    SCYL3 
5 ENSG00000000460 C1orf112 
6 ENSG00000000938      FGR

Merge your existing file with this annotation, assuming your file name is existingfile and the column containing Ensembl Gene IDs is GeneID

merged.file = merge(existingfile, annot, by.x='GeneID', by.y='V1')

Option 2: Using biomaRt in R:

library(biomaRt)
# Get an archived version of ensembl i.e. ensembl 70 in this case
ensembl = useMart("ENSEMBL_MART_ENSEMBL", dataset = "hsapiens_gene_ensembl", host = "jan2013.archive.ensembl.org")
# example list of ENSG ids
ensemblID = c('ENSG00000242959','ENSG00000160396','ENSG00000229494')
# use this list to get corresponding Gene Symbols
results = getBM(attributes = c('hgnc_symbol','ensembl_gene_id'), filters = "ensembl_gene_id", values = ensemblID, mart = ensembl)

# If you have a csv file you can read it like this
ensemblID = read.csv('file.csv')
# get the GeneIDs in the csv file
ensemblID = ensemblID[,1]
# use biomaRt
results = getBM(attributes = c('hgnc_symbol','ensembl_gene_id'), filters = "ensembl_gene_id", values = ensemblID, mart = ensembl)

Comparison of the two methods:

Using Option 1, you get this for the first three Ensembl IDs:

ENSG00000242959    RP4-599G15.3
ENSG00000160396    HIPK4
ENSG00000229494    AC012494.1

Using Option 2:

hgnc_symbol ensembl_gene_id
      HIPK4 ENSG00000160396
            ENSG00000229494
            ENSG00000242959

Therefore, I would recommend you use the first option.

ADD COMMENT
0
Entering edit mode

Hi Komal,

Could you please explain to me how to do that in R. I have my data in .csv file.

ADD REPLY
0
Entering edit mode

Where did you get this data from? What is the Ensembl build?

ADD REPLY
0
Entering edit mode

I used ensemble annotation gtf release 37.7. I am going to identify antisense and I have my file which include gene id, antisense count, sense count, and strand .

ADD REPLY
0
Entering edit mode

here is the link for the GTF file:

ftp://ftp.ensembl.org/pub/release-70/gtf/homo_sapiens

ADD REPLY
0
Entering edit mode

I have updated my answer. From here on, I will leave things to you, you should really work on your R skills. Read biomaRt manual, but first learn R basics because merging files is a very simple task.

Also, there are many questions on Biostars like this.

ADD REPLY
0
Entering edit mode

Search this site for "biomart"; there are lots of examples.

ADD REPLY
0
Entering edit mode

Hi Komal,

I used the first option, and It works very well and I got all results that I need. I want your help to include the source column (lincRNA, antisense, protein_coding, ...) which is the second column in the ensemble gtf file using the awk command above.

ADD REPLY
0
Entering edit mode

I hope you are not talking about the gene_biotype field in the GTF (because that's different than the second column). However, if you want to include the second column, you can get it by modifying the above code like this:

awk '{                                
    for (i = 1; i <= NF; i++) {
        if ($i==$1 || $i ~ /gene_id|gene_name/) {
            printf "%s ", $(i+1)
        }
    }
    print ""
}' Homo_sapiens.GRCh37.70.gtf | sed -e 's/"//g' -e 's/;//g' -e 's/ /\t/' | sort -k1,1 | uniq > Homo_sapiens.GRCh37.70.txt
ADD REPLY
0
Entering edit mode

The second column that I mean as shown below:

1     unprocessed_pseudogene
1     unprocessed_pseudogene
1     unprocessed_pseudogene
1     unprocessed_pseudogene
1     unprocessed_pseudogene
1     unprocessed_pseudogene
1     lincRNA
1     lincRNA
1     miRNA
1     lincRNA
1     lincRNA
1     lincRNA
1     protein_coding
1     protein_coding
1     protein_coding
1     processed_transcript
1     protein_coding
1     protein_coding
1     protein_coding
1     processed_transcript
ADD REPLY
0
Entering edit mode

Yes, the awk command above will do that. It will give the second column as well as the columns that you got before.

ADD REPLY
0
Entering edit mode

I got the results, but when I tried to read this file in R , I got it like that

                       V1                             V2
1 3prime_overlapping_ncrna          ENSG00000039068 CDH1
2 3prime_overlapping_ncrna         ENSG00000103356 EARS2
3 3prime_overlapping_ncrna         ENSG00000166847 DCTN5
4 3prime_overlapping_ncrna ENSG00000234531 RP11-288G11.3
5 3prime_overlapping_ncrna ENSG00000235423 RP11-282O18.3
6 3prime_overlapping_ncrna ENSG00000254652 RP11-678P16.1

so how can I use merge function to merge my file with this file (you can see V2 know include both gene id and gene name). so how can we separate them to be V1 V2 , and V3 to run merege function correctly

ADD REPLY
0
Entering edit mode

Read in the file like this:

annot = read.delim('Homo_sapiens.GRCh37.70.txt',sep = "",header=F)

Now you will get three separate columns, V1, V2 and V3. Then you can merge like before.

ADD REPLY
0
Entering edit mode

I did that and it worked well, but when I used merge function in R as shown below, it gave me more obs. than I want (i.e after merging my file with the annot file it supposed to give me 16442 obs., but it gave me 28453) so is there any option in merge function to match only by same gene id.

merged.file = merge(existingfile, annot, by.x='GeneID', by.y='V2')
ADD REPLY
0
Entering edit mode

Alright here you go,

awk '{                                          
    for (i = 1; i <= NF; i++) {
        if ($i ~ /gene_id|gene_name/) {
            printf "%s %s ", $2, $(i+1)
        }
    }
    print ""
}' Homo_sapiens.GRCh37.70.gtf | sed -e 's/"//g' -e 's/;//g' -e 's/ /\t/g' | cut -f1,2,4 | sort -k1,1 | uniq > Homo_sapiens.GRCh37.70.txt

annot = read.delim('Homo_sapiens.GRCh37.70.txt',header=F)
annot = unique(annot)

# Problem is many Gene IDs have multiple sources for e.g. the Gene ID ENSG00000227070 has two different sources 
dups = annot[which(duplicated(annot$V2)),]
annot[grep('ENSG00000227070',annot$V2),]
               V1              V2            V3
64  ambiguous_orf ENSG00000227070 RP11-191G24.1
725     antisense ENSG00000227070 RP11-191G24.1

So you won't get a 1-to-1 relationship if you include the gene source, unfortunately.

ADD REPLY
0
Entering edit mode

I couldn't see it.

ADD REPLY
0
Entering edit mode

So is there any other way to get source that match gene id 1 to 1 only. or is there any way to remove the undesired sources from the list.

ADD REPLY
0
Entering edit mode

Honestly, you will have to ask your supervisor about that.

ADD REPLY
0
Entering edit mode

Thanks a lot for helping me.

ADD REPLY
0
Entering edit mode
awk '{                                
for (i = 1; i <= NF; i++) {
    if ($i ~ /gene_id|gene_name/) {
        printf "%s ", $(i+1)
    }
}
print ""

}' Homo_sapiens.GRCh37.70.gtf | sed -e 's/"//g' -e 's/;//g' -e 's/ /\t/' | sort -k1,1

can anyone explain me this what each loop does and the sed command

ADD REPLY
6
Entering edit mode
9.4 years ago
Ming Tommy Tang ★ 3.9k

You might be interested in the bioconductor package mygene.

ADD COMMENT
1
Entering edit mode

Using mygene R package to get symbol is very easy:

library(mygene)
gene.list = c('ENSG00000242959', 'ENSG00000160396', 'ENSG00000229494', 'ENSG00000230262', 'ENSG00000229240', 'ENSG00000223569', 'ENSG00000214262', 'ENSG00000256306', 'ENSG00000260580', 'ENSG00000267466', 'ENSG00000180878', 'ENSG00000233177', 'ENSG00000240567', 'ENSG00000233436', 'ENSG00000257826', 'ENSG00000233115', 'ENSG00000256642', 'ENSG00000229438', 'ENSG00000254429', 'ENSG00000184258')
getGenes(gene.list, fields='symbol')

It returns symbols in a DataFrame nicely:

DataFrame with 20 rows and 4 columns
     notfound           query        symbol             _id
    <logical>     <character>   <character>     <character>
1        TRUE ENSG00000242959            NA              NA
2          NA ENSG00000160396         HIPK4          147746
3          NA ENSG00000229494  LOC101927948       101927948
4          NA ENSG00000230262    MIRLET7DHG          158257
5          NA ENSG00000229240     LINC00710 ENSG00000229240
...       ...             ...           ...             ...
16         NA ENSG00000233115     FAM90A11P ENSG00000233115
17         NA ENSG00000256642     LINC00273          649159
18       TRUE ENSG00000229438            NA              NA
19         NA ENSG00000254429 CTD-2562J17.7 ENSG00000254429
20         NA ENSG00000184258          CDR1            1038
ADD REPLY
1
Entering edit mode

Thank you! It looks to me like (at least for these purposes) mygene is clearly preferable to BiomaRt (ordered output and NAs when the value isn't found!)

ADD REPLY
1
Entering edit mode
9.4 years ago
Emily 23k

Use BioMart. Here is a help video. Filter by ID list limit and your list of IDs, get the Ensembl ID and Associated Gene Name as attributes.

ADD COMMENT
1
Entering edit mode

Hi Emily,

Thanks for helping me. I did all steps according to the video, but I noticed two issues with that. First, I have 16243 gene id's but the biomart results gave me only 15596 that's mean there are about 647 genes are missing. Second, I ordered my genes id according to the locus, but the biomart results sorted them differently.

ADD REPLY
0
Entering edit mode

If you're using IDs from release 70, you'll want to use BioMart on our archive site. Also, BioMart doesn't sort IDs, it just spits them out randomly. Include the location attributes, then you can sort your table when you're done.

ADD REPLY
0
Entering edit mode

Aren't you able to get the Gene Names & IDs using awk & merge in R? If you wanted to do it in Biomart, why did you emphasize on getting an R based solution?

ADD REPLY
0
Entering edit mode

Hi Komal,

I did both using R and biomart, but I found an option in biomart which has link directly to ensemble website which give more details on these genes. I run your awk and R and they work nicely. Thanks again for your helping

ADD REPLY
0
Entering edit mode

I have updated my answer so that you can use biomart in R using the Ensembl 70 build.

ADD REPLY

Login before adding your answer.

Traffic: 1853 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6