I've done denovo transcriptome assembly using Stringtie with paired end RNA-seq reads coming from a mid/large size experiment (60+ samples) conducted on rabbit: I've followed developers' protocol, so alignment+assembly+merging. Now, from the GTF, I thought about two possible ways:
- Quantification via Stringtie using the denovo merged GTF file: then using authors' script, convert samples quantification into a single gene-level matrix to use with DESeq2.
- Using the merged GTF, building a transcriptome using gffread and then do alignment/mapping against the transcriptome: import counts and analyze with DESeq2
I want to ask if both seems reasonable and, if not, which is better. I've got some trouble when trying to implement them:
With the first approach: the same gene was present multiple times while doing DESeq2::results
> n_occur <- data.frame(table(rownames(countData))) > dim(n_occur)  39722 2 >n_occur[n_occur$Freq > 1,]  Var1 Freq <0 rows> (or 0-length row.names)
So no duplicates seems to be present.
dds <- DESeqDataSetFromMatrix(countData, colData, design=~0 + group) res <- results(dds, contrast = c("group", "D31", "NT0"), alpha = 0.05, pAdjustMethod = "BH") # lfc shrinkage to have best results. res_lfcsh <- lfcShrink(dds, contrast = c('group', 'D31', 'NTO'), res = res, type = "ashr")
"","baseMean","log2FoldChange","lfcSE","stat","pvalue","padj" "ENSOCUG00000034460",4.82720066318387,4.93460839115356,1.98406628126073,2.48711872065986,0.0128782414726579,NA "ENSOCUG00000034460.1",4.82720066318387,4.93460839115356,1.98406628126073,2.48711872065986,0.0128782414726579,NA "ENSOCUG00000034460.2",4.82720066318387,4.93460839115356,1.98406628126073,2.48711872065986,0.0128782414726579,NA
These "genes", like
ENSOCUG00000034460.1, are duplications and are not present in original gene count matrix obtained via prepDE. Where does they come from? They appear many times (200-300 and more).
With the second: is it possible to create a linkedTxome to use with Salmon with the denovo GTF and the transcriptome generated via gffread? Is it necessary to build a TxDb to analyze Salmon data?
Thanks and sorry for the multiple questions.