Using Tophat/Bowtie, I'm producing an RNA expression (FPKM) plot of a paralog in three different samples. The problem is the first specimen (blue bar in the image) has a huge error bar (over 12 FPKM in the image). I'm trying to figure out why this is so large, and how if possible, to reduce it.
I've been trying many different tophat parameters to see if there is any change... i.e. --b2-sensitive --transcriptome-only --no-novel-juncs, etc... Does anyone have any suggestions what parameters might help? (scoring or alignment options?)
Here is my pipeline. I'm then using the R library cummeRbund to produce the expression bar plot.
bowtie2-build quiver.fa quiver
tophat -p 8 -G quiver.gtf -o tophat_out_1 quiver PK_RNA.fq
tophat -p 8 -G quiver.gtf -o tophat_out_2 quiver PK_flower_RNA.fq
tophat -p 8 -G quiver.gtf -o tophat_out_3 quiver PK_root_RNA.fq
cufflinks -p 8 -o cufflinks_out_1 tophat_out_1/accepted_hits.bam
cufflinks -p 8 -o cufflinks_out_2 tophat_out_2/accepted_hits.bam
cufflinks -p 8 -o cufflinks_out_3 tophat_out_3/accepted_hits.bam
cuffmerge -g quiver.gtf -s quiver.fa -p 8 assemblies.txt
cuffdiff -o diff_out -b quiver.fa -p 8 -L pk_1,pk_2,pk_3 -u merged_asm/merged.gtf tophat_out_1/accepted_hits.bam tophat_out_2/accepted_hits.bam tophat_out_3/accepted_hits.bam
Thanks for your help!
Not the answer you are looking for, but you should know that the old 'Tuxedo' pipeline of Tophat and Cufflinks is no longer the "advisable" tool for RNA-seq analysis. The software is deprecated/ in low maintenance and should be replaced by HISAT2, StringTie and ballgown. See this paper: Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown. (If you can't get access to that publication, let me know and I'll -cough- help you.) There are also other alternatives, including alignment with STAR and bbmap, or pseudo-alignment using salmon.
In Cufflinks, you may be able to control this with the
–max-bundle-frags
command line parameter. However, as Wouter mentions, Tophat/Cufflinks are in the past and one should move to HISAT2/StringTie.In addition, FPKM does not deal very well with high counts, generally speaking, and produces extra-ordinary fold change values as a result. It neither normalises across samples and, therefore, it's unreliable to begin to statistically compare samples that have FPKM normalised counts. Unless you're analysing just a single sample, I would recommend geometric normalisation, if that's available in StringTie.
Did you look at the extreme expression values before analysis?