Question: What is the best way to match gene id with gene name.
2
gravatar for M K
4.6 years ago by
M K460
United States
M K460 wrote:

Hi All,

I have a list of gene id's and I want to match those with gene names. For example this is a small part of the list as shown:

 

Gene id strand
ENSG00000242959 1
ENSG00000160396 -1
ENSG00000229494 1
ENSG00000230262 -1
ENSG00000229240 -1
ENSG00000223569 1
ENSG00000214262 -1
ENSG00000256306 1
ENSG00000260580 1
ENSG00000267466 -1
ENSG00000180878 1
ENSG00000233177 1
ENSG00000240567 1
ENSG00000233436 -1
ENSG00000257826 -1
ENSG00000233115 -1
ENSG00000256642 -1
ENSG00000229438 1
ENSG00000254429 1
ENSG00000184258 -1

 

rna-seq R • 8.2k views
ADD COMMENTlink modified 4.6 years ago by Emily_Ensembl18k • written 4.6 years ago by M K460
6
gravatar for Ming Tang
4.6 years ago by
Ming Tang2.5k
Houston/MD Anderson Cancer Center
Ming Tang2.5k wrote:

You might be interested in the bioconductor package mygene http://www.bioconductor.org/packages/release/bioc/html/mygene.html

ADD COMMENTlink written 4.6 years ago by Ming Tang2.5k
1

Using mygene R package to get symbol is very easy:

library(mygene)
gene.list = c('ENSG00000242959', 'ENSG00000160396', 'ENSG00000229494', 'ENSG00000230262', 'ENSG00000229240', 'ENSG00000223569', 'ENSG00000214262', 'ENSG00000256306', 'ENSG00000260580', 'ENSG00000267466', 'ENSG00000180878', 'ENSG00000233177', 'ENSG00000240567', 'ENSG00000233436', 'ENSG00000257826', 'ENSG00000233115', 'ENSG00000256642', 'ENSG00000229438', 'ENSG00000254429', 'ENSG00000184258')
getGenes(gene.list, fields='symbol')

It returns symbols in a DataFrame nicely:

DataFrame with 20 rows and 4 columns
     notfound           query        symbol             _id
    <logical>     <character>   <character>     <character>
1        TRUE ENSG00000242959            NA              NA
2          NA ENSG00000160396         HIPK4          147746
3          NA ENSG00000229494  LOC101927948       101927948
4          NA ENSG00000230262    MIRLET7DHG          158257
5          NA ENSG00000229240     LINC00710 ENSG00000229240
...       ...             ...           ...             ...
16         NA ENSG00000233115     FAM90A11P ENSG00000233115
17         NA ENSG00000256642     LINC00273          649159
18       TRUE ENSG00000229438            NA              NA
19         NA ENSG00000254429 CTD-2562J17.7 ENSG00000254429
20         NA ENSG00000184258          CDR1            1038
ADD REPLYlink written 2.9 years ago by Newgene350
1

Thank you! It looks to me like (at least for these purposes) mygene is clearly preferable to BiomaRt (ordered output and NAs when the value isn't found!)

ADD REPLYlink written 2.5 years ago by lm68750
5
gravatar for komal.rathi
4.6 years ago by
komal.rathi3.4k
Children's Hospital of Philadelphia, Philadelphia, PA
komal.rathi3.4k wrote:

If you know the Ensembl build that you used to get the Gene IDs, you can easily get the corresponding Gene Names from Biomart or even the GTF file of the same build. If you want to do it in R, use the biomaRt package.

Option 1: Using awk to get Gene ID and Name from GTF file:

You could use awk to get associated Gene ID & Gene Names from the GTF file:

awk '{                                
    for (i = 1; i <= NF; i++) {
        if ($i ~ /gene_id|gene_name/) {
            printf "%s ", $(i+1)
        }
    }
    print ""
}' Homo_sapiens.GRCh37.70.gtf | sed -e 's/"//g' -e 's/;//g' -e 's/ /\t/' | sort -k1,1 | uniq > Homo_sapiens.GRCh37.70.txt

# Read this file in R

annot = read.delim('Homo_sapiens.GRCh37.70.txt', header=F)
> head(annot)
               V1        V2
1 ENSG00000000003   TSPAN6 
2 ENSG00000000005     TNMD 
3 ENSG00000000419     DPM1 
4 ENSG00000000457    SCYL3 
5 ENSG00000000460 C1orf112 
6 ENSG00000000938      FGR

# merge your existing file with this annotation, assuming your file name is existingfile and the column containing Ensembl Gene IDs is GeneID

merged.file = merge(existingfile, annot, by.x='GeneID', by.y='V1')

Option 2: Using biomaRt in R:

library(biomaRt)
# Get an archived version of ensembl i.e. ensembl 70 in this case
ensembl = useMart("ENSEMBL_MART_ENSEMBL", dataset = "hsapiens_gene_ensembl", host = "jan2013.archive.ensembl.org")
# example list of ENSG ids
ensemblID = c('ENSG00000242959','ENSG00000160396','ENSG00000229494')
# use this list to get corresponding Gene Symbols
results = getBM(attributes = c('hgnc_symbol','ensembl_gene_id'), filters = "ensembl_gene_id", values = ensemblID, mart = ensembl)

# If you have a csv file you can read it like this
ensemblID = read.csv('file.csv')
# get the GeneIDs in the csv file
ensemblID = ensemblID[,1]
# use biomaRt
results = getBM(attributes = c('hgnc_symbol','ensembl_gene_id'), filters = "ensembl_gene_id", values = ensemblID, mart = ensembl)

Comparison of the two methods:

Using Option 1, you get this for the first three Ensembl IDs:

ENSG00000242959    RP4-599G15.3
ENSG00000160396    HIPK4
ENSG00000229494    AC012494.1

Using Option 2:

hgnc_symbol ensembl_gene_id
      HIPK4 ENSG00000160396
            ENSG00000229494
            ENSG00000242959

Therefore, I would recommend you use the first option.

ADD COMMENTlink modified 4.6 years ago • written 4.6 years ago by komal.rathi3.4k

Hi Komal,

Could you please explain to me how to do that in R. I have my data in .csv file.

ADD REPLYlink written 4.6 years ago by M K460

Where did you get this data from? What is the Ensembl build? 

ADD REPLYlink written 4.6 years ago by komal.rathi3.4k

I used ensemble annotation gtf release 37.7. I am going to identify antisense and I have my file which include gene id, antisense count, sense count, and strand .

ADD REPLYlink modified 4.6 years ago • written 4.6 years ago by M K460

here is the link for the GTF file:

ftp://ftp.ensembl.org/pub/release-70/gtf/homo_sapiens

ADD REPLYlink modified 4.6 years ago • written 4.6 years ago by M K460

I have updated my answer. From here on, I will leave things to you, you should really work on your R skills. Read biomaRt manual, but first learn R basics because merging files is a very simple task. 

Also, there are many questions on Biostars like this.

ADD REPLYlink modified 4.6 years ago • written 4.6 years ago by komal.rathi3.4k

Search this site for "biomart"; there are lots of examples.

ADD REPLYlink written 4.6 years ago by Neilfws48k

Hi Komal, 

I used the first option, and It works very well and I got all results that I need. I want your help to include the source column ((lincRNA, antisense, protein_coding,....) which is the second column in the ensemble gtf file using the awk command above.

ADD REPLYlink written 4.5 years ago by M K460

I hope you are not talking about the gene_biotype field in the GTF (because that's different than the second column). However, if you want to include the second column, you can get it by modifying the above code like this:

awk '{                                
    for (i = 1; i <= NF; i++) {
        if ($i==$1 || $i ~ /gene_id|gene_name/) {
            printf "%s ", $(i+1)
        }
    }
    print ""
}' Homo_sapiens.GRCh37.70.gtf | sed -e 's/"//g' -e 's/;//g' -e 's/ /\t/' | sort -k1,1 | uniq > Homo_sapiens.GRCh37.70.txt
ADD REPLYlink modified 4.5 years ago • written 4.5 years ago by komal.rathi3.4k

The second column that I mean as shown below with yellow highlight :

1 unprocessed_pseudogene
1 unprocessed_pseudogene
1 unprocessed_pseudogene
1 unprocessed_pseudogene
1 unprocessed_pseudogene
1 unprocessed_pseudogene
1 lincRNA
1 lincRNA
1 miRNA
1 lincRNA
1 lincRNA
1 lincRNA
1 protein_coding
1 protein_coding
1 protein_coding
1 processed_transcript
1 protein_coding
1 protein_coding
1 protein_coding
1 processed_transcript
ADD REPLYlink written 4.5 years ago by M K460

Yes, the awk command above will do that. It will give the second column as well as the columns that you got before.

ADD REPLYlink modified 4.5 years ago • written 4.5 years ago by komal.rathi3.4k

I got the results, but when I tried to read this file in R , I got it like that

                       V1                             V2
1 3prime_overlapping_ncrna          ENSG00000039068 CDH1 
2 3prime_overlapping_ncrna         ENSG00000103356 EARS2 
3 3prime_overlapping_ncrna         ENSG00000166847 DCTN5 
4 3prime_overlapping_ncrna ENSG00000234531 RP11-288G11.3 
5 3prime_overlapping_ncrna ENSG00000235423 RP11-282O18.3 
6 3prime_overlapping_ncrna ENSG00000254652 RP11-678P16.1 

so how can I use merge function to merge my file with this file (you can see V2 know include both gene id and gene name). so how can we seperate them to be V1 V2 , and V3 to run merege function correctly

 

ADD REPLYlink written 4.5 years ago by M K460

Read in the file like this:

annot = read.delim('Homo_sapiens.GRCh37.70.txt',sep = "",header=F)

Now you will get three separate columns, V1, V2 and V3. Then you can merge like before.

ADD REPLYlink written 4.5 years ago by komal.rathi3.4k

I did that and it worked well, but when I used merge function in R as shown below, it gave me more obs. than I want (i.e after merging my file with the annot file it supposed to give me 16442 obs., but it gave me 28453) so is there any option in merge function to match only by same gene id.

merged.file = merge(existingfile, annot, by.x='GeneID', by.y='V2')
ADD REPLYlink modified 4.5 years ago • written 4.5 years ago by M K460

Alright here you go,

awk '{                                          
    for (i = 1; i <= NF; i++) {
        if ($i ~ /gene_id|gene_name/) {
            printf "%s %s ", $2, $(i+1)
        }
    }
    print ""
}' Homo_sapiens.GRCh37.70.gtf | sed -e 's/"//g' -e 's/;//g' -e 's/ /\t/g' | cut -f1,2,4 | sort -k1,1 | uniq > Homo_sapiens.GRCh37.70.txt

annot = read.delim('Homo_sapiens.GRCh37.70.txt',header=F)
annot = unique(annot)

# Problem is many Gene IDs have multiple sources for e.g. the Gene ID ENSG00000227070 has two different sources 
dups = annot[which(duplicated(annot$V2)),]
annot[grep('ENSG00000227070',annot$V2),]
               V1              V2            V3
64  ambiguous_orf ENSG00000227070 RP11-191G24.1
725     antisense ENSG00000227070 RP11-191G24.1

So you won't get a 1-to-1 relationship if you include the gene source, unfortunately.

ADD REPLYlink modified 4.5 years ago • written 4.5 years ago by komal.rathi3.4k

I couldn't see it.

ADD REPLYlink written 4.5 years ago by M K460

So is there any other way to get source that match gene id 1 to 1 only. or is there any way to remove the undesired sources from the list.

ADD REPLYlink written 4.5 years ago by M K460

Honestly, you will have to ask your supervisor about that.

ADD REPLYlink written 4.5 years ago by komal.rathi3.4k

Thanks a lot for helping me.

ADD REPLYlink written 4.5 years ago by M K460
awk '{                                
for (i = 1; i <= NF; i++) {
    if ($i ~ /gene_id|gene_name/) {
        printf "%s ", $(i+1)
    }
}
print ""

}' Homo_sapiens.GRCh37.70.gtf | sed -e 's/"//g' -e 's/;//g' -e 's/ /\t/' | sort -k1,1

can anyone explain me this what each loop does and the sed command

ADD REPLYlink written 2.1 years ago by krushnach80500
1
gravatar for Emily_Ensembl
4.6 years ago by
Emily_Ensembl18k
EMBL-EBI
Emily_Ensembl18k wrote:

Use BioMart. Here's a help video. Filter by ID list limit and your list of IDs, get the Ensembl ID and Associated Gene Name as attributes.

ADD COMMENTlink written 4.6 years ago by Emily_Ensembl18k
1

Hi Emily,

Thanks for helping me. I did all steps according to the video, but I noticed two issues with that. First, I have 16243 gene id's but the biomart results gave me only 15596 that's mean there are about 647 genes are missing. Second, I ordered my genes id according to the locus, but the biomart results sorted them differently. 

ADD REPLYlink written 4.6 years ago by M K460

If you're using IDs from release 70, you'll want to use BioMart on our archive site. Also, BioMart doesn't sort IDs, it just spits them out randomly. Include the location attributes, then you can sort your table when you're done.

ADD REPLYlink written 4.6 years ago by Emily_Ensembl18k

Aren't you able to get the Gene Names & IDs using awk & merge in R? If you wanted to do it in Biomart, why did you emphasize on getting an R based solution?

ADD REPLYlink modified 4.6 years ago • written 4.6 years ago by komal.rathi3.4k

Hi Komal, 

I did both using R and biotmart, but I found an option in biomart which has link directly to ensemble website which give more details on these genes. I run your awk and R and they work nicely. Thanks again for your helping

ADD REPLYlink written 4.6 years ago by M K460

I have updated my answer so that you can use biomart in R using the Ensembl 70 build.

ADD REPLYlink written 4.6 years ago by komal.rathi3.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1509 users visited in the last hour