Question: tophat-cufflinks command pipeline
gravatar for biolab
5.3 years ago by
biolab1.2k wrote:

Hi, everyone,

I am a newhand working on RNA-seq data.  Tophat-cufflinks method is widely used in analyzing RNA-seq data.  I have installed these tools, and now need to run them.  From tophat and cufflinks websites, I learned some basics as below, but I am not sure if my commands are right.   I am wondering could anyone offers a pipeline or gives me a guidance, particularly the command options?  THANKS a lot!

step1: generate a tophat_out folder with bam files

tophat  <index>  sample1_1.fq  sample1_2.fq    

tophat  <index>  sample2_1.fq  sample2_2.fq

step2: generate .gtf files

cufflinks sample1/accepted_hits.bam     

cufflinks sample2/accepted_hits.bam

step3: prepare a text file named assemblies.txt with following gtf files



step4: run cuffmerge to generate merged.gtf

cuffmerge assemblies.txt

step5: discovery of novel transcript

cuffcompare merged.gtf

step6: compare gene expressions of two samples

cuffdiff merged.gtf  sample1/accepted_hits.bam  sample2/accepted_hits.bam



cufflinks tophat • 22k views
ADD COMMENTlink modified 3.8 years ago by pjmaguire380 • written 5.3 years ago by biolab1.2k

HERE IS MY PIPELINE FOR ucsc.hg19 reference genome and ucsc refgene gtf file:

./tophat2 -o tp_out --library-type fr-unstranded  -r 260 --fusion-search -G /home/refgene.gtf -p8 --b2-very-sensitive /home/ucsc.hg19.fasta  /home/R1_001.fastq /home/R2_001.fastq --rg-id Sample --rg-sample sample --rg-library rna-seq --rg-platform Illumina

./cufflinks -p 8 --library-type fr-unstranded -o cufflinks_out_gtf -G /home/refgene.gtf /home/tophat-/accepted_hits.bam

This I repeat for all samples.

./cuffmerge -o all_samples_merged -s /home/ucsc.hg19.fasta assemblies.txt -g /home/refgene.gtf -p 8 assemblies.txt

And finally cuffdiff.

./cuffdiff -o all_samples_diff -L 1,2,3,4,5,6,7,8,9,10 -b /home/ucsc.hg19.fasta -u all_samples_merged/merged.gtf -p 8 --library-type fr-unstranded -c 5 PATH/TO/ALL/MY/BAMS

And finally I run R - CummeRbund and have issue explained above . Sorry maybe for stupid questions.


ADD REPLYlink written 5.3 years ago by Paul1.4k

I'm trying to accomplish the same task, but I'm new at this field, so, how do I get the refgene gtf file and the ucsc.hg19 reference genome? 

ADD REPLYlink written 4.9 years ago by Ana20

What organism?

ADD REPLYlink written 4.9 years ago by Parham1.4k

For homo sapiens too.

ADD REPLYlink written 4.9 years ago by Ana20

Have you done a bit of Googling?!

ADD REPLYlink written 4.9 years ago by Parham1.4k

Yes, I did, but I was unable to find the .gtf file and also, at the beginning I thought it would be some specific kind of file other then the "normal" reference genome of human. Now I understood what it takes, thank you, and I'm sorry for my lack of knowledge at this point. 

ADD REPLYlink written 4.9 years ago by Ana20

No worries. Hope it helped =)

ADD REPLYlink written 4.9 years ago by Parham1.4k

You should really add some information on your data, in order to give proper guidance on what you need for your analysis.

ADD REPLYlink written 5.3 years ago by biogirl190
gravatar for pjmaguire3
3.8 years ago by
pjmaguire380 wrote:

I wouldn't recommend using TopHat/Cufflinks. While they used to be the go-to option for RNA-Seq analysis, now a days there are much better alternatives out there, particularly STAR. Conveniently I just made a figure showing a short overview of my protocol, so I linked that below. Do note that RSEM might be a better option for getting read counts, but it does require that you run your STAR mapping through its wrapper (which is why I went with a more general approach).


  1. Check quality of reads with FastQC. If they are good then proceed, but if the quality is low consider trimming the ends and then re-evaluating.
  2. Using your genome and gene annotation file, create a reference genome using STAR.
  3. Map your reads against the annotated reference genome (output to BAM to avoid needing to convert things later).
  4. Use featureCounts to get the read counts.
  5. Do a TMM transformation on the read data using Limma (this is far superior to FPKMs in normalizing the data, especially if the main focus is to identify differentially expressed genes).
  6. Use Limma to calculate the differentially expressed genes.

enter image description here

Hope that helps you, and feel free to shoot me some questions. Best of luck!

ADD COMMENTlink written 3.8 years ago by pjmaguire380


Could you please share your opinion about HISAT-Stringtie2-Ballgown pipeline? I am going to follow this pipeline and still in a little confusion regarding how it performs compared to others.


ADD REPLYlink modified 3.8 years ago • written 3.8 years ago by pbigbig200

As far as I am aware, there has not been a published study done comparing HISAT2 to STAR (HISAT came out in 2015, so it is relatively new overall), so I can't really appropriately compare them in a non-biased manner. But I imagine that HISAT2 wouldn't perform as well when it comes to sliced alignments since it uses the BWA algorithm to map reads. One of the things that is nice about HISAT2 is that it has a much smaller footprint than STAR, so if you're doing a lot of RNA-Seq analytics it could be very beneficial to decreased the required storage space for each sample run. HISAT2 also reports that it is slightly faster than STAR, but I would take that with a grain of salt since the author's were the ones who selected what datsets to use when bench marking the different mappers. Moral of the story, I still recommend STAR as the first choice but I consider HISAT2 to be a reasonably good second place holder. As for the rest of the pipeline I have no idea, I have never even heard of those two(2) software suites so it wouldn't be fair for me to comment one way or the other on their abilities.

This is a good read, if you haven't gone through it yet: Current state-of-the-art for RNA-Seq gene / transcript expression quantification OR just the tools of preference

I also recommend you read this paper if you are interested in comparing the different read normalization methods: A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis.

ADD REPLYlink written 3.8 years ago by pjmaguire380

Thank you very much for your detail explanation (and your recommended article)!

ADD REPLYlink written 3.8 years ago by pbigbig200

Hi, It's an old thread but I found a paper talking about this recently: Comparative assessment of methods for the computational inference of transcript isoform abundance from RNA-seq data. In its conclution part, it said "Importantly, our analysis indicates that the explicit quantification of transcript isoforms leads to more accurate estimates of gene expression levels compared to the ‘count-based’ methods that are broadly used currently". But they evaluated the count-based method with "bedtools intersect", not htseq-count or featureCounts, so I am a bit skeptical of this conclusion. What do you think about this?


ADD REPLYlink written 2.2 years ago by niuyw30
gravatar for Parham
5.3 years ago by
Parham1.4k wrote:

Have you seen this protocol? It is very helpful.

ADD COMMENTlink written 5.3 years ago by Parham1.4k

Thank you very much, Paul and Parham.

ADD REPLYlink written 5.3 years ago by biolab1.2k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1087 users visited in the last hour