I'm confused on how should I call htseq-count for my dataset concerning option --stranded [yes or no].
I'm analysing paired-end RNASeq data, at the alignment step with tophat I produced one single bam file from the two reads (R1, R2). So I have one bam file with both strands, this file is sorted by name with samtools and when calling htseq-count I firstly assumed I had to say stranded=yes, I got my counts and there were many zeros, so I run again with stranded=no and got many more read counts, less genes with zeros and genes generally with more counts. These are the numbers for both performances:
stranded=yes no feature = 49 258 937 ambiguous = 8 393
stranded=no no feature = 4 654 795 ambiguous = 1781 755
Since the number of "no feature" is considerably lower in stranded=no, should I use this tag? Is this a proof that this is right call?
So I don't know if my logic is right, I had strand specific data but since I merged them (R1 and R2) in the alignment step, I have everything together in the bam and I have to tell this to htseq-count by setting the -stranded=yes even if this is not completely right, any sense??
As another clue I run infer_experiment.py with the merged-sorted bam file from the RSeQC package and the output was:
This is PairEnd Data Fraction of reads failed to determine: 0.0012 Fraction of reads explained by "1++,1--,2+-,2-+": 0.0827 Fraction of reads explained by "1+-,1-+,2++,2--": 0.9161
This tells me I have strand-specific again ... so maybe I should come back to htseq-count --stranded=yes??
Any help is very welcome!! Thanks a lot!