Single Cell Gene Count Matrix with Ensembl IDs as Rownames. Need to convert to Gene Names.
1
2
Entering edit mode
17 months ago
achamess ▴ 90

I know this question has been asked in various iterations before, and it seems straightforward but I can't figure out how to get it to work. I've tried various things, spent a lot of time.

I have a gene count matrix with cells as columns and rownames are Ensembl IDs for mouse.

                [,1]
ENSMUSG00000104352    0
ENSMUSG00000104046    0
ENSMUSG00000102907    0
ENSMUSG00000025905    0
ENSMUSG00000103936    0
ENSMUSG00000093015    0

I tried something like this

rownames(counts) <- mapIds(org.Mm.eg.db,keys=rownames(counts),column="SYMBOL",keytype="ENSEMBL",multiVals="first")

But the issue I run into is that I get many NAs because not every Ensembl ID maps to a Gene Name. Also, for some Gene Names, multiple Ensembl IDs map.

So if I run the code above, I get this output:

       [,1]
<NA>       0
Gm26206    0
Xkr4       0
Gm18956    0
<NA>       0
<NA>       0
<NA>       0
<NA>       0
<NA>       0
Gm7341     0

I saw this response, to keep Ensembl IDs if NA, but it didn't work because some gene names are duplicated and the matrix can't have duplicate row names.

R: converting Ensembl row names to Symbol ID outputs missing values in 'row.names' are not allowed

Can someone point me in the right direction on how to deal with the NAs and duplicates?

The goal is to replace the rownames with Gene Names, so when I do my downstream Seurat work, I don't have to keep looking up Ensembl IDs

ensembl single_cell genomics • 2.5k views
ADD COMMENT
0
Entering edit mode

because not every Ensembl ID is unique.

You have duplicate ID's in your matrix?

ADD REPLY
0
Entering edit mode

Sorry. I'll change the phrasing. Every Ensembl ID is unique but multiple Ensembl IDs map to the same gene name.

ADD REPLY
0
Entering edit mode
ADD REPLY
2
Entering edit mode
17 months ago

assuming you have a data.frame df that has a gene_name and gene_id column, you could use the gene name if it exists, and there are no duplicates gene names, or else use the gene id.

dup_genes <- names(table(df$gene_name)[table(df$gene_name) > 1])

df$feature <- ifelse(is.na(df$gene_name) | df$gene_name %in% dup_genes, df$gene_id, df$gene_name)
rownames(df) <- df$gene_id

rownames(counts) <- df[rownames(counts), ]$feature
ADD COMMENT
0
Entering edit mode

Thank you for putting me out of my misery :D Good to see the approach. It worked. Here is my complete code. Made a few changes.

counts_df <- as.data.frame(counts)

counts_df$gene_name <- mapIds(org.Mm.eg.db,keys=rownames(counts),column="SYMBOL",keytype="ENSEMBL",multiVals="first")

counts_df$gene_id <- rownames(counts_df)

dup_genes <- counts_df[duplicated(counts_df$gene_name),]

counts_df$feature <- ifelse((is.na(counts_df$gene_name) | counts_df$gene_name %in% dup_genes), counts_df$gene_id, counts_df$gene_name)

rownames(counts) <- counts_df[rownames(counts), ]$feature
ADD REPLY

Login before adding your answer.

Traffic: 2329 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6