Problem with Ensembl version identifiers after running DESeq2
Entering edit mode
6.7 years ago
Lila M ★ 1.3k

Hi everybody, I have a problem with my Ensembl ID after running DESEq2 (I'm using hg38 genome):

dds <- DESeq(ds_matrix)
res <- results(dds)

                   baseMean log2FoldChange     lfcSE      stat       pvalue         padj  
ENSG00000176124.11  168.67880  4.991104 0.2797296 17.842601 3.299728e-71 6.057971e-67

As you could see, the identifiers are ENSG00000176124.11, for example, so when I've tried to annotate the genes using,

res$symbol <- mapIds(,
                     keys = row.names(res),
                     column = "SYMBOL",
                     keytype = "ENSEMBL",
                     multiVals = "first")

or using gage, the ID with the dots and the number after it are not recognized and can be match. So does anyone know how to deal with this problem?


DESEq2 Ensembl identifiers annotation • 2.8k views
Entering edit mode
6.7 years ago

As far as I'm aware, the . at the end of an Ensembl gene ID denotes the version. If you omit the version and try to search for ENSG00000176124, that should fix the issue. The bigger question is why you have the version in the gene IDs in the first place...

Entering edit mode

Because the counts where done using salmon and the original files included the identifiers with the dot, so I don't know if I have to remove it from the original files or there is other way to do that... because the ID with the dot are not recognized.

Entering edit mode

It depends on how the salmon index was generated. Generally you can just strip the .xx extension from your IDs to make it work. keys = gsub("\\..*$", "",row.names(res)),

Entering edit mode

Hey, I ran into the same issue after also using salmon to quanitfy against a gencode index. This seemed to work for me, but a lot of the mapped ID's refer to genes with NA values. Is there a way to limit the analysis to well annotated genes? I'm not sure what to make of these differences exactly, seeing these huge fold changes but mostly for things that I don't know what they are.

> row.names(YCNT.05_Subset) = gsub("\\..*", "",row.names(YCNT.05_Subset)) 

> YCNT.05_Subset$genename <- mapIds(,keys = row.names(YCNT.05_Subset), column = "SYMBOL", keytype = "ENSEMBL", multiVals = "first")

'select()' returned 1:1 mapping between keys and columns

> YCNT.05_Subset

log2 fold change (MLE): condition YB6CNT vs YBJCNT 
Wald test p-value: condition YB6CNT vs YBJCNT 
DataFrame with 24 rows and 7 columns
                    baseMean log2FoldChange     lfcSE      stat       pvalue         padj    genename
                   <numeric>      <numeric> <numeric> <numeric>    <numeric>    <numeric> <character>
ENSMUSG00000110704 22.424026      -45.31601  5.115848 -8.662495 4.615481e-18 8.555948e-14          NA
ENSMUSG00000082016 21.264101      -30.86202  6.149671 -4.855873 1.198576e-06 2.613953e-03          NA
ENSMUSG00000094568  9.902831      -30.47642  6.151707 -4.791584 1.654700e-06 3.228843e-03          NA
ENSMUSG00000103651 52.698629      -29.23280  6.149360 -4.591178 4.407518e-06 7.781368e-03          NA
ENSMUSG00000059195 57.152734      -27.95013  5.706919 -4.722360 2.331231e-06 4.321520e-03          NA
...                      ...            ...       ...       ...          ...          ...         ...
ENSMUSG00000005800  21.84068       22.60947  4.501899  4.800077 1.586044e-06 3.228843e-03        Mmp8
ENSMUSG00000022026  14.05982       26.72935  6.134014  4.194537 2.734298e-05 4.223921e-02       Olfm4
ENSMUSG00000084936  55.66428       28.41237  3.742702  7.324221 2.402904e-13 1.781754e-09          NA
ENSMUSG00000093752  19.27306       33.77075  4.913983  6.668876 2.577693e-11 1.194600e-07          NA
ENSMUSG00000074555  11.50912       36.31998  5.086493  6.943876 3.814853e-12 2.357261e-08     Gm10714

Login before adding your answer.

Traffic: 2977 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6