Hi all I have some doubts on rna-seq analysis. I have rna-seq data (paired end using trueseq) from patients - 4 tumor and one normal (not matched normal). I am trying to find differentially expressed genes. So far so good - aligned with tophat2, estimated expression with cufflinks, got counts with ht-seq count, did differential expression using DESeq2. But here are my concerns -
1. While counting with htseq-count I am getting very low counts assigned to features. On an average only 30% of the fragments are assigned to features (refseq gtf). Maximum of the remaining 70% fragments are ambiguous or no feature. Is it common to find such low percentage of counts ? I read one could get around 60%. I tried with all modes in htseq (reverse union, stranded union, unstranded union and with intersection nonempty) and the best result was with unstranded union mode.
2. For some of the features ht-seq count assigned zero counts. For example between region chr20:3776385-3786768 htseq gives -
3919 SAM alignment pairs processed. NM_001287516 0 NM_001287517 0 NM_001287518 0 NM_001287519 0 NM_001287520 0 NM_001287522 0 NM_001287524 5 NM_004358 0 NM_021872 0 NM_021873 0 __no_feature 82 __ambiguous 3683 __too_low_aQual 0 __not_aligned 0 __alignment_not_unique 149
Of ~3900 pairs 3683 are ambiguous and only 5 are assigned to one of the feature. I get this, since there are so many overlapping genes within 10kb, there is no way to tell which gene a fragment originated from, hence so many ambiguous. But when I used cufflinks for the same (some of) genes I get high FPKM values -
NM_001287516 1.69492e-71 NM_001287517 9.62116e-75 NM_001287518 1.05537e-103 NM_001287519 0.000182958 NM_001287520 31.2244 NM_001287522 6.82458 NM_001287524 1.72693 NM_004358 1.47296 NM_021872 0.017841 NM_021873 17.0096
Does this mean cufflinks does not account for ambiguous reads while calculating FPKM ?
3. Some of the threads I saw, say that for trueseq libraries we should use
fr-firststrand as library type during tophat2 alignment. But here it is metioned to use
fr-unstranded (I guess it should also be considered while using htseq). Which is the right way ?
4. Since I do not have replicates/samples for normal tissue, using DESeq2 is advisable or should I be better off with fold changes from FPKM?