Question

batch effect in RNAseq analysis using tophat cufflinks pipeline

0

Entering edit mode

5.8 years ago

yff • 0

Hi everyone,

I am using tophat and cufflinks pipeline to do RNA seq analysis on my data. I am new to RNA seq. I guess there might be some batch effect in my RNa seq data. I am not sure how can I detect and correct for batch effect? Do you have any idea about this?

Thanks,

RNA-Seq tophat batch-effect cufflinks • 2.3k views

ADD COMMENT • link updated 6 months ago by Ram 44k • written 5.8 years ago by yff • 0

0

Entering edit mode

See comment from the Tophat author

enter image description here

ADD REPLY • link 5.8 years ago by andrew.j.skelton73 6.6k

score 2 · Answer 1 · 2019-01-08

Tophat is deprecated, avoid if possible.

Firstly, visualise your problem with a PCA - this will give you clues as to where your primary variation is coming from, and if a batch effect is clearly present. Try one of the following methodologies:

Stringtie and DESeq2

Stringtie is an alternative to Cufflinks and has a mode, and script called prepde.py that will prepare Stringtie quantifications for analysis in DESeq2 (another popular analysis framework that can deal with batch effects. See Quick Start section for a note on batch effects.

Quantify Known Annotation:

Install Salmon (I'd recommend via Bioconda)
Create a Salmon Index from the transcripts GTF (a human example is here)
Quantify your fastq with Salmon
Using RStudio, import your counts using the tximport package
Voom transform your data using Limma (see section 15, page 69 here)
Visualise your data using PCA and check known variables
Account for your batch effect using an additive model design (see chapter 9, page 40)

Novel Transcript Discovery (Gene Level Analysis)

Note: There is a lot of noise when performing this, so be prepared to implement several heuristic filters on top to get plausible transcripts

Install Stringtie (available in Bioconda here) (or Cufflinks)
Install STAR (available on Bioconda here)
Build a STAR index (see documentation here)
Align each sample using STAR
Run Stringtie on each sample (results in a GTF file per sample) (or Cufflinks)
Run Stringtie Merge on the GTF files to get an experiment wide GTF
Using your input Fasta file and newly merged GTF, extract the transcript sequences. The easiest way is gffread, which is now built in to stringtie.
Build a Salmon index from your new transcripts Fasta file
Quantify your fastq with Salmon
Follow step 4 onwards from above.

Note2: I like Taco for refining my novel set of transcripts to find possible protein coding subsets, but that has some drawbacks

Edit: Cufflinks is not deprecated, I should have made that clear

score 2 · Answer 2 · 2019-01-08

Batch effects are typically found via MDS or PCA plots and (hierarchical) clustering where samples clusters differently than you would expect.
Batch effects can (to my knowledge) not be corrected for within Cufflinks/CuffDiff so you would need to re-quantify the merged gtf file with tools such as Kallisto or Salmon (I have written about RNAseq quantification choices including all appropriate links here) and then do your analysis with another DE tool such as DESeq2 or edgeR such as this tutorial describes.

Please note:

If you use Cufflinks/Cuffdiff you need to have a very good reason (see quantification discussed here (again))
Isoform level analysis such as the one you have with Cufflinks also allows you to do analysis of e.g. isoform switches - something my R package IsoformSwitchAnalyzeR can help you with. You can find examples of what type of analysis you can do in this section of the vignette.