I am trying to annotate the results output file from Desq2 so it contains gene names and symbols. The RNA-seq count file I have used comes from Dexseq and contains ensembl transcript ID:
ENSMUSG00000000001:001
ENSMUSG00000000001:002
ENSMUSG00000000001:003
etc.
I have tried various methods to annotate the results.
1. downloaded annotation from Biomart.
> library(DESeq2)
counts = read.delim("3mTA2.txt", header=T, row.names=1)
sample <- read.delim("~/sample.txt")
count.data.set <- DESeqDataSetFromMatrix(countData=counts, colData=sample,design= ~ genotype)
dds<-DESeq(count.data.set)
res <- results(dds)
annotation <- read.delim("mouse.annt.txt") # load annotation file from Biomart
res$EnsemblID <- row.names(res)
res <- merge(res, annotation, by = 'EnsemblID', all.x = TRUE)
It adds column to the output file but values are blank.
2. Also used AnnotationDbi
library("AnnotationDbi")
library("org.Mmu.eg.db")
res$symbol <- mapIds(org.Mmu.eg.db,
+ keys=row.names(res),
+ column="SYMBOL",
+ keytype="ENSEMBL",
+ multiVals="first")
Error in .testForValidKeys(x, keys, keytype) :
None of the keys entered are valid keys for 'ENSEMBL'. Please use the keys method to see a listing of valid arguments.
Any suggestions?
ENSMUSG00000000001:001; ENSMUSG00000000001:002 - these refer to the the different exons of the gene.
So the question I suppose is how to combine or merge the different exon counts for the same gene into one count for the gene?
Can this be done in Dexseq or Deseq2?
You don't want to do that, since doing so will double count a number of things. Just run either htseq-count or featureCounts (this is much faster) and directly get gene level metrics.
The initial analysis was performed elsewhere. So I only have the Dexseq count file with ensemble ids of all the different exons of a gene. How can i use this file to proceed - either by annotating exons ids into a gene or using the file in Deseq2 and then annotate ?
That's unfortunate, particularly if you don't have the BAM or fastq files. Indeed, the best you can do is just remove the :E??? from the names, sum over the results and use that. Note that the results will then be approximate. You could do that with awk.