Question: Taking into account antinsense reads when analyzing stranded Illumina data with DESeq2
0
gravatar for snamjoshi87
2.7 years ago by
snamjoshi8740
snamjoshi8740 wrote:

I have just finished running count commands using HTSeq-Count and am about to begin DGE analysis with DESeq2 using the vignette as a guide. However, I found this blog post online which indicates that there may be issues if you do not take into account the fact that Illumina data produces antisense reads.

Having read the blog post a few times, I'm still a little confused. It seems that there is now an argument to the summarizeOverlap() function preprocess.reads=invertStrand that allows you to take strandedness into account. Is this all I need to alter for my data? Or do I need to invert the strand using GenomicRanges? Outside of this, I did not take the strandedness into account with my alignment or counts so I am concerned that my raw count data may also be incorrect.

Here is what I ran for Tophat2 alignment:

tophat2 -G <GTF file> -p 4 -o <output folder> <index> <FASTQ file>

And here is what I ran for HTSeq-Count:

python -m HTSeq.scripts.count -s yes -a 10 <name sorted SAM> <GTF file> > <output>

As you can see, I did not take into account the library type for my tophat2 run. For HTSeq-Count I specified it was stranded but not to reverse it. So I really am not sure where this leaves me...

According the facility that did the sequencing, they used the TruSeq Stranded Total RNA kit. This experiment is single-read from mouse neural tissue (RIPSeq). Let me know if any more information is needed.

In summary, my questions:

  1. Is my count data okay as is?
  2. What do I need to do in DESeq2 to correct for this antisense issue?

Versions: Tophat 2.1.1 with Bowtie 2.2.9 and Samtools 0.1.19; HTSeq 0.6.1

rna-seq deseq2 • 1.4k views
ADD COMMENTlink modified 2.7 years ago by Devon Ryan89k • written 2.7 years ago by snamjoshi8740
3
gravatar for Devon Ryan
2.7 years ago by
Devon Ryan89k
Freiburg, Germany
Devon Ryan89k wrote:

The blog post you linked to seems to have been written by someone without much experience in RNAseq (while the DESeq2 tutorial demonstrates how to do counting in R, I don't know of many people who would actually do that (if they did, they'd use subRead anyway)). I can assure you that the DESeq2 authors have plenty of experience with both stranded and unstranded datasets and you don't need to do anything in DESeq2 for dealing with strand-specific datasets (you're just giving it counts, it doesn't matter what the underlying library type is at that point). Note that with anything recent, you'll want -s reverse with htseq-count rather than -s yes. This is for historical reasons (early on the common stranded protocols worked different).

Regarding tophat2, you might get slightly higher alignment rates if you tell it your data is stranded. You're unlikely to get completely different results by not telling the aligner this, you'll just have a bit higher false negative rate.

ADD COMMENTlink written 2.7 years ago by Devon Ryan89k

By using the command -s reverse I got a lot more counts than what I originally had. The strandedness issue is very confusing to me but I think I figured it out after digging through old forum posts. Thanks for your help.

ADD REPLYlink written 2.7 years ago by snamjoshi8740
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 896 users visited in the last hour