Dear Community,
I am trying to generate RNAseq "feature count" tables with htseq-count, and I have run into an issue regarding how to properly deal with stranded sequence data.
Background: Paired-end RNA libraries were prepared using a stranded protocol (Illumina TruSeq Stranded mRNA Sample Preparation Kit), and the reads were mapped to the reference genome using tophat2. I was provided the tophat2 accepted-hits.bam files, and asked to conduct differential expression analysis, which I was hoping to do via DESeq2 after first generating feature count tables using htseq-count.
Problem: I've learned that the tophat2 mapping was simply conducted using default settings, which means that the library type was assumed to be unstranded (--library-type fr-unstranded
), and not stranded (--library-type fr-firststrand
).
Questions: Since the mappings were performed under the assumption that reads were unstranded, should I continue to assume likewise while running htseq-count (via --stranded=no
)? Or should I request that the data be remapped using the correct library type in tophat2, and only then proceed with htseq-count (using, I believe, --stranded=reverse
)?
Many thanks!
Thank you for the reply. So, if I'm understanding correctly, it should be possible to assume strandedness while applying htseq-count because, presumably, htseq-count is actually inspecting each read to determine strand, rather than relying on the tophat2 tag. Is that indeed what you're saying?
Thanks again!
in the strand specific protocol reads are separated into file1 or file2 based on their strand. For example all reads in file2 are in the same orientation as the original transcript. So the tool will use only one of the files, the one that corresponds to the correct strand.