Hi,
I have run my BAM files through Rsubread (ie using featurecounts) and then through DESeq2 for read normalisation, however, when I choose to count by features (ie by exon not gene) I get multiple rows returned for the same transcript ID:
So, taking the ensembl gene IDs from my normalised counts table as the keys for mapID function in annotationdbi to return the gene symbol, txid, txbiotype, etc:
normalized_counts_new2$SYMBOL <- mapIds(EnsDb.Hsapiens.v86, keys = normalized_counts_new2$GeneID, column = "SYMBOL", keytype = "GENEID", multiVals = "first")
normalized_counts_new2$TXID <- mapIds(EnsDb.Hsapiens.v86, keys = normalized_counts_new2$GeneID, column = "TXID", keytype = "GENEID", multiVals = "first")
The results of the first gene in my table shows that this function has called all the rows the same ensembl transcript IDs, but for this gene we would expect two transcripts (DDX11L1 ensembl page). I guess to be expected using multivals = "first" but I can't set this to list or characterlist without getting an error for mismatch between number of rows between replacement and data
So, then I tried using a select function in place of multiple mapIDs. This gives me the result I would expect, but I can't directly write this output to multiple new columns in the counts table. Of course I thought about joining the normalised counts table and the newly generated annotation table but the problem is that the number of rows in the counts table is 1182163 and in the select function results is 198002, so I wouldn't know which annotations join to which rows:
normalized_counts_new3 <- normalized_counts_new
normalized_counts_new3_annotations <- AnnotationDbi::select(EnsDb.Hsapiens.v86, keys=normalized_counts_new3$GeneID, columns = c("SYMBOL", "GENEBIOTYPE", "GENENAME", "TXID", "TXBIOTYPE"), keytype = "GENEID", multiVals = "list")
SO, my Qs:
Why am I getting multiple rows for the same transcript generated during subread?
Am I supposed to consolidate the counts on different rows with duplicate transcriptIDs...? seems odd if so
How can I annotate these data properly? The select function gives the expected different transcripts but can't be joined to the counts table as the number of rows is 6 fold less
update: I've read some more and now understand that Rsubread by features is actually counting at exon level, so the multiple rows represent the different exons (can be annotated to show exonID) Answer in this post from Rsubread co-author was v helpful How to get read counts on transcript level using featurecounts?
So new Q is as such:
Are you completely sure that you want transcript level analysis, and not gene level analysis? I don't think DESeq is designed for transcript level analysis.