I have a simple two group experiment. Treatment Vs. Control, 4 biological replicates. The design is 150bp. Read depth is 100 million reads per sample, paired end. Each sample has a ERCC spike-in for post processing quality assessment. Pipeline is as follows->
- Run FastQC on each fastq file.
- Quality looks excellent, but Illumina adapters are present.
- Paired-End trimmomatic run to remove adapter sequences.
- FastQC is run again to verify adapter removal.
- Lanes are merged using bash cat.
- Tophat2 is run in paired end mode to align reads.
subprocess.call(['tophat2','-p','64','-r','10','--mate-std-dev','51','--mate-inner-dist','-101','--b2-very-sensitive','--no-novel-juncs','--microexon-search', '--coverage-search','--library-type','fr-secondstrand','--output-dir',output_folder,'--GTF',gtf_file,bowtie2_builds,leftfileString, rightfileString])
Summary files look okay with ~70-75% concordant alignment.
HTseq is then used to count reads. So now I construct a read count table and the data are highly variable both within and between groups.
Does anyone have an idea about what could cause this? Thanks!
For context, here are the box plots from a different experiment I ran a few months back: