Question: RNA-seq High Variation Between and Within Groups
gravatar for jacobsen.jeremy
2.2 years ago by
United States
jacobsen.jeremy40 wrote:

I have a simple two group experiment. Treatment Vs. Control, 4 biological replicates. The design is 150bp. Read depth is 100 million reads per sample, paired end. Each sample has a ERCC spike-in for post processing quality assessment. Pipeline is as follows->

  1. Run FastQC on each fastq file.
  2. Quality looks excellent, but Illumina adapters are present.
  3. Paired-End trimmomatic run to remove adapter sequences.
  4. FastQC is run again to verify adapter removal.
  5. Lanes are merged using bash cat.
  6. Tophat2 is run in paired end mode to align reads.

Command used:['tophat2','-p','64','-r','10','--mate-std-dev','51','--mate-inner-dist','-101','--b2-very-sensitive','--no-novel-juncs','--microexon-search', '--coverage-search','--library-type','fr-secondstrand','--output-dir',output_folder,'--GTF',gtf_file,bowtie2_builds,leftfileString, rightfileString])

Summary files look okay with ~70-75% concordant alignment.

HTseq is then used to count reads. So now I construct a read count table and the data are highly variable both within and between groups.

Green (Vehicle treated), Orange (Drug Treated), scale is log2!

Does anyone have an idea about what could cause this? Thanks!

For context, here are the box plots from a different experiment I ran a few months back:

rna-seq tophat2 htseq • 672 views
ADD COMMENTlink modified 2.2 years ago by h.mon31k • written 2.2 years ago by jacobsen.jeremy40

See: How to add images to a Biostars post

You need the direct link to the image, not the link to the page that hosts the image.

ADD REPLYlink written 2.2 years ago by RamRS30k

What do the box-plots show? You didn't tell us.

As you are finding the results strange, I would advise to not concatenate the lanes, but rather map and count them separately. Have all samples been sequenced to approximately the same depth? Besides lane, are there other batch effects?

On the other hand, this may be true biological variance, some experiments have higher variances and lower fold-changes than others.

ADD REPLYlink written 2.2 years ago by h.mon31k

Sorry. The box plots show the distribution of read counts per replicate. The first four are the vehicle treated replicates and the last four are drug treated. The desired read depth was 100M but what we received varied between 65M and 120M per replicate. There doesn't appear to be any correlation between read depth and variation.

ADD REPLYlink written 2.2 years ago by jacobsen.jeremy40

Just a side comment: are yo sure that your data is forward-stranded? option --library-type fr-secondstrand. Most popular RNAseq lib. prep. protocols today are reverse-stranded.

ADD REPLYlink written 2.2 years ago by grant.hovhannisyan2.0k

I was originally using QoRTs for read counting and it complained that the strandedness might be different than specified so I used Chipster to verify that it is fr-secondstranded. The runs I had done in the past were firststranded, but I think the sequencing facility has upgraded their platform.

ADD REPLYlink written 2.2 years ago by jacobsen.jeremy40

What is the y-axis? Please provide some more details.

ADD REPLYlink written 2.2 years ago by ATpoint40k

Y axis is (RLE = log-ratio of read count to median read count across sample). These were generated using RUVseq which is software that uses the ERCC spike-ins to nomalize the data. The data in the figure is unnormalized.

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by jacobsen.jeremy40
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1728 users visited in the last hour