BioMart dropping and duplicating Ensembl IDs while retrieving corresponding gene symbols?
2
1
Entering edit mode
4 days ago
skjw1029 ▴ 30

I'm trying to convert Ensembl IDs to Gene symbols within a summarized experiment object (more or less an expression matrix) using BioMart.

mart <- useDataset("hsapiens_gene_ensembl", useMart("ENSEMBL_MART_ENSEMBL"))
genes <- rownames(gse_cellgenefiltered_cohort1)
G_list <- getBM(filters= "ensembl_gene_id", attributes= c("ensembl_gene_id", "hgnc_symbol"),values=genes,mart= mart)

For some reason, there is a discrepancy between the number of Ensembl IDs I supply BioMart with and the number of Ensembl IDs it returns.

length(rownames(gse_cellgenefiltered_cohort1))

[1] 23395

length(G_list$ensembl_gene_id)

[1] 23316

Another thing I noticed, is that BioMart returns duplicated Ensembl IDs for some of them.

length(unique(G_list$ensembl_gene_id))

[1] 23314

I don't think there are any duplicated Ensembl IDs in the expression matrix.

length(unique(rownames(gse_cellgenefiltered_cohort1)))

[1] 23395

Would anyone know why this might be happening?

ensembl BioMart • 256 views
ADD COMMENT
1
Entering edit mode
4 days ago
Mike Smith ★ 1.7k

There's a few things that might be going on, and it's hard to tell exactly without some examples of the missing or duplicated gene IDs. Here's some ideas though.

BioMart will silently drop any element in values that aren't found in the query. There's no error or anything, you just don't get a hit. That's easy to see with a single value, harder to spot in 23,000:

## query not found in Ensembl
getBM(values = c("ENSG_NOT_REAL"),
      filter = "ensembl_gene_id",
      attributes = c("ensembl_gene_id", "hgnc_symbol"),
      mart = mart)
#> [1] ensembl_gene_id hgnc_symbol    
#> <0 rows> (or 0-length row.names)

You can try to identify what input values aren't returned in the results with something like genes[ !genes %in% G_list$ensembl_gene_id ]. If that finds something I'd search the Ensembl website manually with a few of the IDs and try to understand why they might be missing from BioMart e.g. they might be from an old Ensembl version and have been retired - there are probably many possible reasons.

For completeness I'll also point out that Ensembl BioMart will ignore duplicate entries in the the values argument e.g..

## duplicated input values
getBM(values = c("ENSG00000010404", "ENSG00000010404"),
      filter = "ensembl_gene_id",
      attributes = c("ensembl_gene_id", "hgnc_symbol"),
      mart = mart)
#>   ensembl_gene_id hgnc_symbol
#> 1 ENSG00000010404         IDS

However it looks like you've already checked this isn't the case in your data.

Regarding the duplicated entries in the results, this can occur if there is a one-to-many mapping between the two ID types you're trying to find e.g.

## one-to-many mapping
getBM(values = "ENSG00000277796",
      filter = "ensembl_gene_id",
      attributes = c("ensembl_gene_id", "hgnc_symbol"),
      mart = mart)
#>   ensembl_gene_id hgnc_symbol
#> 1 ENSG00000277796      CCL3L3
#> 2 ENSG00000277796      CCL3L1

Mapping between IDs from different organisations is never perfect and it's pretty common to see instances like this, where a single Ensembl ID maps to two HGNC IDs (or vice versa). You could try to identify the duplicated entries with

G_list[ duplicated(G_list$ensembl_gene_id) | duplicated(G_list$ensembl_gene_id, fromLast = TRUE), ]
ADD COMMENT
1
Entering edit mode
3 days ago

There are some issues when using BiomaRt to retrieve or convert your id's. In your case, as Mike replied you, you lost some information because some genes lack of hgnc symbol. To solve this problem I suggest you to include the entrezgene_id value in the attributes to retrieve as well as the gene_biotype. On the other hand, use the left_join() function from dplyr to merge your query with your converted id's in order to preserve your original gene id's. However, I don't know whether summarized experiment objects allow you to make that kind of operations.

Respect to the duplicated id's, it could be associated to the biotype of your genes. In my experience, lncRNA. sometimes present two hgnc symbol. Another way to solve this problem is my using the distinct() function from dplyr. Here is a snippet of what you can do:

G_list <- distinct(G_list, ensembl_gene_id, .keep_all = TRUE) 

Best regards!

ADD COMMENT

Login before adding your answer.

Traffic: 2092 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6