I've done denovo transcriptome assembly using Stringtie with paired end RNA-seq reads coming from a mid/large size experiment (60+ samples) conducted on rabbit: I've followed developers' protocol, so alignment+assembly+merging. Now, from the GTF, I thought about two possible ways:
- Quantification via Stringtie using the denovo merged GTF file: then using authors' script, convert samples quantification into a single gene-level matrix to use with DESeq2.
- Using the merged GTF, building a transcriptome using gffread and then do alignment/mapping against the transcriptome: import counts and analyze with DESeq2
I want to ask if both seems reasonable and, if not, which is better. I've got some trouble when trying to implement them:
With the first approach: the same gene was present multiple times while doing DESeq2::results
> n_occur <- data.frame(table(rownames(countData)))
> dim(n_occur)
[1] 39722 2
>n_occur[n_occur$Freq > 1,]
[1] Var1 Freq
<0 rows> (or 0-length row.names)
So no duplicates seems to be present.
dds <- DESeqDataSetFromMatrix(countData, colData, design=~0 + group)
res <- results(dds, contrast = c("group", "D31", "NT0"), alpha = 0.05,
pAdjustMethod = "BH")
# lfc shrinkage to have best results.
res_lfcsh <- lfcShrink(dds, contrast = c('group', 'D31', 'NTO'), res = res, type = "ashr")
Visualizing results
"","baseMean","log2FoldChange","lfcSE","stat","pvalue","padj"
"ENSOCUG00000034460",4.82720066318387,4.93460839115356,1.98406628126073,2.48711872065986,0.0128782414726579,NA
"ENSOCUG00000034460.1",4.82720066318387,4.93460839115356,1.98406628126073,2.48711872065986,0.0128782414726579,NA
"ENSOCUG00000034460.2",4.82720066318387,4.93460839115356,1.98406628126073,2.48711872065986,0.0128782414726579,NA
These "genes", like ENSOCUG00000034460.1
, are duplications and are not present in original gene count matrix obtained via prepDE. Where does they come from? They appear many times (200-300 and more).
With the second: is it possible to create a linkedTxome to use with Salmon with the denovo GTF and the transcriptome generated via gffread? Is it necessary to build a TxDb to analyze Salmon data?
Thanks and sorry for the multiple questions.