Issues with adding gene name and entrez ID to DESeq2 result having Ensembl ID.
1
0
Entering edit mode
2.2 years ago

I have added gene name and Entrez ID in my DESeq2 result. The commands I have used are:

res$hgnc_symbol <- convertIDs(gsub("\\..*","", row.names(res)), "ENSEMBL", "SYMBOL", org.Hs.eg.db) res$entrezgene <- convertIDs(gsub("\\..*","", row.names(res)), "ENSEMBL", "ENTREZID", org.Hs.eg.db)

resOrdered <- res[order(res\$pvalue),]


After, checking the object resOrdered:

head(resOrdered)


I am getting like this:

     gene_id             stat               pvalue                 padj
<numeric>            <numeric>            <numeric>
ENSG00000280228.1   -5.9096792673878 3.42774467723773e-09 1.53944643600176e-05
ENSG00000225555.1  -5.88657749721615 3.94274922781857e-09 1.53944643600176e-05
ENSG00000234616.7  -5.77542188212913 7.67605073778235e-09 1.99807600704475e-05
ENSG00000058866.13 -4.88480635581578 1.03530552769589e-06  0.00163524543935493
ENSG00000180152.3   -4.8645382294294 1.14724332367248e-06  0.00163524543935493
ENSG00000244968.5  -4.84652440600431 1.25643137868224e-06  0.00163524543935493
hgnc_symbol  entrezgene
<character> <character>
ENSG00000280228.1           NA          NA
ENSG00000225555.1           NA          NA
ENSG00000234616.7          JRK        8629
ENSG00000058866.13        DGKG        1608
ENSG00000180152.3           NA          NA
ENSG00000244968.5     LIFR-AS1   100506495


Here, I am not getting some gene names and Entrez ID and it is showing NA.

I have aligned my data to GRCh38 (took GTF from same assembly) with STAR and count file were created using htseq-count.

What could be the reason of these. Please put your valuable suggestion how I should go forward with this?

RNA-Seq DESeq2 ensembl • 1.6k views
ADD COMMENT
1
Entering edit mode

Did you check your GTF file? Instead of using gene_id as identifier, you need to use gene_name to get the desired output (gene names) when you do the counting. No mapping would be needed in that case. See an example line of annotation below.

chr1    HAVANA  gene    11869   14409   .       +       .       gene_id "ENSG00000223972.5"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; level 2; hgnc_id "HGNC:37102"; havana_gene "OTTHUMG00000000961.1";


gene_id "ENSG00000223972.5" = gene_name "DDX11L1"

gene_id "ENSG00000280228.1"= gene_name "AC079753.1"

ADD REPLY
0
Entering edit mode

Thank you so much. If I'm not wrong then org.Hs.eg.db package has no record for these IDs. The gene name exist in the GTF files when I checked it manually. Yes that's a great idea. I should Use gene_name instead of id while counting with htseq.

ADD REPLY
2
Entering edit mode
2.2 years ago

The issue, based on your output, is that these Ensembl genes have no equivalent records in Entrez or HGNC. Reasons for this can vary, but it usually implies that the gene is not actually experimentally confirmed to exist, or is novel and has minimal annotation. These databases and the services that interact with them are not updated dynamically, as far as I am aware, and operate on version release cycles.

For example, ENSG00000280228 is a TEC (To be Experimentally Confirmed) gene. The others appear to be novel and have minimal annotation.

Kevin

ADD COMMENT
2
Entering edit mode

In addition, because it may be of value to the OP, I would also add that when performing gene ID conversions sometimes the mappings may not have 1:1 concordance and thus one will have to decide what to with IDs which map to multiple other IDs, and the meaning in his particular experiment, so be on the lookout (maybe for the particular function's output used in the question this is not the case)

ADD REPLY
0
Entering edit mode

Indeed, it can be a tricky procedure (mapping between different annotations). Utmost care must be taken.

ADD REPLY
0
Entering edit mode

Thanks for giving your time. Yes that is true. The name exist in the GTF file. Shall I use BioMart for adding gene names instead of AnnotationDb and org.Hs.eg.db?

ADD REPLY
0
Entering edit mode

You could try the solution mentioned above, i.e., using a different ID during HTseq; however, you will still end up with NA entries. Using biomaRt will also present the same problem. There is not really any comprehensive way around this. Using Ensembl gene IDs is already, perhaps, the best way, at least for performing read count abundance. After, you can then filter down the raw data 'logically' for particular genes, such as protein coding only, and then using these in, e.g., DESeq2 or EdgeR.

ADD REPLY

Login before adding your answer.

Traffic: 1810 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6