batch effect in RNAseq analysis using tophat cufflinks pipeline
2
0
Entering edit mode
3.5 years ago
yff • 0

Hi everyone,

I am using tophat and cufflinks pipeline to do RNA seq analysis on my data. I am new to RNA seq. I guess there might be some batch effect in my RNa seq data. I am not sure how can I detect and correct for batch effect? Do you have any idea about this?

Thanks,

RNA-Seq batch effect tophat cufflink • 1.5k views
0
Entering edit mode
2
Entering edit mode
3.5 years ago

Tophat is deprecated, avoid if possible.

Firstly, visualise your problem with a PCA - this will give you clues as to where your primary variation is coming from, and if a batch effect is clearly present. Try one of the following methodologies:

Stringtie and DESeq2

Stringtie is an alternative to Cufflinks and has a mode, and script called prepde.py that will prepare Stringtie quantifications for analysis in DESeq2 (another popular analysis framework that can deal with batch effects. See Quick Start section for a note on batch effects.

Quantify Known Annotation:

1. Install Salmon (I'd recommend via Bioconda)
2. Create a Salmon Index from the transcripts GTF (a human example is here)
3. Quantify your fastq with Salmon
4. Using RStudio, import your counts using the tximport package
5. Voom transform your data using Limma (see section 15, page 69 here)
6. Visualise your data using PCA and check known variables
7. Account for your batch effect using an additive model design (see chapter 9, page 40)

Novel Transcript Discovery (Gene Level Analysis)

Note: There is a lot of noise when performing this, so be prepared to implement several heuristic filters on top to get plausible transcripts

1. Install Stringtie (available in Bioconda here) (or Cufflinks)
2. Install STAR (available on Bioconda here)
3. Build a STAR index (see documentation here)
4. Align each sample using STAR
5. Run Stringtie on each sample (results in a GTF file per sample) (or Cufflinks)
6. Run Stringtie Merge on the GTF files to get an experiment wide GTF
7. Using your input Fasta file and newly merged GTF, extract the transcript sequences. The easiest way is gffread, which is now built in to stringtie.
8. Build a Salmon index from your new transcripts Fasta file
9. Quantify your fastq with Salmon
10. Follow step 4 onwards from above.

Note2: I like Taco for refining my novel set of transcripts to find possible protein coding subsets, but that has some drawbacks

0
Entering edit mode

I thought it was only Tophat and not Cufflinks/Cuffdiff that was deprecated? Cole Trapnell (the first author of Cufflinks still uses it frequently). You can simply sup TopHat for Hisat.

1
Entering edit mode

That's fair, Cufflinks (Faux Tiling) and StringTie (Network Flow) use different methodologies, so you're right that TopHat can be substituted with HISAT2 or STAR. Bit of an oversight on my part.

2
Entering edit mode
3.5 years ago
1. Batch effects are typically found via MDS or PCA plots and (hierarchical) clustering where samples clusters differently than you would expect.
2. Batch effects can (to my knowledge) not be corrected for within Cufflinks/CuffDiff so you would need to re-quantify the merged gtf file with tools such as Kallisto or Salmon (I have written about RNAseq quantification choices including all appropriate links here) and then do your analysis with another DE tool such as DESeq2 or edgeR such as this tutorial describes.

1. If you use Cufflinks/Cuffdiff you need to have a very good reason (see quantification discussed here (again))
2. Isoform level analysis such as the one you have with Cufflinks also allows you to do analysis of e.g. isoform switches - something my R package IsoformSwitchAnalyzeR can help you with. You can find examples of what type of analysis you can do in this section of the vignette.