DESeq2 + Rsubread: multiple rows for the same transcript ID?? Annotationdbi mapID and select give different results??
Entering edit mode
12 months ago
grt9 • 0


I have run my BAM files through Rsubread (ie using featurecounts) and then through DESeq2 for read normalisation, however, when I choose to count by features (ie by exon not gene) I get multiple rows returned for the same transcript ID:

So, taking the ensembl gene IDs from my normalised counts table as the keys for mapID function in annotationdbi to return the gene symbol, txid, txbiotype, etc:

normalized_counts_new2$SYMBOL <- mapIds(EnsDb.Hsapiens.v86, keys = normalized_counts_new2$GeneID, column = "SYMBOL", keytype = "GENEID", multiVals = "first")
normalized_counts_new2$TXID <- mapIds(EnsDb.Hsapiens.v86, keys = normalized_counts_new2$GeneID, column = "TXID", keytype = "GENEID", multiVals = "first")

The results of the first gene in my table shows that this function has called all the rows the same ensembl transcript IDs, but for this gene we would expect two transcripts (DDX11L1 ensembl page). I guess to be expected using multivals = "first" but I can't set this to list or characterlist without getting an error for mismatch between number of rows between replacement and data mapID function results

So, then I tried using a select function in place of multiple mapIDs. This gives me the result I would expect, but I can't directly write this output to multiple new columns in the counts table. Of course I thought about joining the normalised counts table and the newly generated annotation table but the problem is that the number of rows in the counts table is 1182163 and in the select function results is 198002, so I wouldn't know which annotations join to which rows:

normalized_counts_new3 <- normalized_counts_new
normalized_counts_new3_annotations <- AnnotationDbi::select(EnsDb.Hsapiens.v86, keys=normalized_counts_new3$GeneID, columns = c("SYMBOL", "GENEBIOTYPE", "GENENAME", "TXID", "TXBIOTYPE"), keytype = "GENEID", multiVals = "list")

select function results

SO, my Qs:

  • Why am I getting multiple rows for the same transcript generated during subread?

  • Am I supposed to consolidate the counts on different rows with duplicate transcriptIDs...? seems odd if so

  • How can I annotate these data properly? The select function gives the expected different transcripts but can't be joined to the counts table as the number of rows is 6 fold less

rsubread ens.db deseq2 transcriptid annotationdbi • 665 views
Entering edit mode

update: I've read some more and now understand that Rsubread by features is actually counting at exon level, so the multiple rows represent the different exons (can be annotated to show exonID) Answer in this post from Rsubread co-author was v helpful How to get read counts on transcript level using featurecounts?

So new Q is as such:

  • if transcriptID is mapped using ensID and Rsubread isn't designed for transcript level analysis, then how do I know whether my gene expression is from a protein coding transcript or not?
Entering edit mode

Are you completely sure that you want transcript level analysis, and not gene level analysis? I don't think DESeq is designed for transcript level analysis.


Login before adding your answer.

Traffic: 861 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6