Question: Rna-Seq Pipeline
gravatar for brentp
7.0 years ago by
Salt Lake City, UT
brentp22k wrote:

So, there're papers on designing an RNA-seq experiment, and normalizing the data (Bullard et. al and the recent Genetics paper are good reads) but what do folks do for the actual pipeline.

I'm looking at

  1. filter on quality. (what are your quality/parameter cutoffs?)
  2. any other pre-processing?
  3. tophat
  4. cufflinks
  5. repeat 1-4 for different set of reads and find differentially expressed genes (cuffdiff)

First, any steps I should add?

Second, there doesn't seem to be much about how to do this. I mean I can read the manuals and execute the commands (steps 3, 4 seem no problem), but I'm looking any pointers to either:

  1. fully documented pipelines with a explanation of the processing at each step
  2. shell script(s) of going from reads to differentially expressed genes.
  3. pubs where this is documented.

I realize each set of data will be different, but it'd be nice to base it on something.

ADD COMMENTlink modified 20 months ago by Malachi Griffith15k • written 7.0 years ago by brentp22k
gravatar for Dstan
7.0 years ago by
Provo, Utah, USA
Dstan140 wrote:

We're getting ready to publish a study in which we use RNA-seq, and we used a piece of software called GNUMAP. We did not apply any filtering on the read qualities, as we found that lower-quality reads simply didn't map as well. As far as the post-mapping analysis, we're still waiting to hear back from our statistics colleagues on the model they've developed.

As far as an out-of-the-box solution for RNA-seq, I'm not sure how much you'll be able to find.

ADD COMMENTlink written 7.0 years ago by Dstan140

hadn't heard of GNUMAP, checking it out now. i'm not expecting an out-of-the-box solution, just trying to make use of existing knowledge.

ADD REPLYlink written 7.0 years ago by brentp22k
gravatar for Wjeck
7.0 years ago by
Chapel Hill, NC
Wjeck480 wrote:

No idea about where these steps exist as a well documented whole, but I can pass on our experience. We're doing a pretty massive amount of RNA-seq at our institution as part of The Cancer Genome Atlas, and our methods are along the lines you describe.

Bowtie/Tophat for mapping has been our best bet for spliced sequence alignment. I know the group working on this tried other techniques with mapping onto a reference "transcriptome" that has some advantages in terms of mapping but can be harder to deconvolute in cases where transcripts overlap.

ADD COMMENTlink written 7.0 years ago by Wjeck480

thanks, at least it's good to know you decided on a similar overall pipeline after looking around.

ADD REPLYlink written 7.0 years ago by brentp22k
gravatar for Michael Dondrup
6.9 years ago by
Bergen, Norway
Michael Dondrup41k wrote:

I think, one important step that is missing here could be

  1. remove/condense (100%?) identical reads into one read

in the filtering step. A large amount of reads could be e.g. artifacts from a PCR step in the wet-lab pipeline. This can be done e.g. with the tool FASTA collapser from the FASTX tools. For a quantitative approach I would prefer this, but I guess it's controversial. Any experiences with that?

Another filtering step can be to clip the reads removing low-quality regions instead of removing only total reads.

ADD COMMENTlink modified 6.9 years ago • written 6.9 years ago by Michael Dondrup41k

My understanding is that removing identical reads is a step that is typical for DNA analysis, but more controversial when it comes to RNA-Seq because the rationale for it is less clear here (are we only removing PCR artifacts, or also introducing a quantitative bias?).

ADD REPLYlink written 4.0 years ago by jobinv1.1k

Note, I wrote this almost 3 years ago. Now, I wouldn't do it anymore for a differential analysis, with the argument that on average PCR-artifacts should equally affect both conditions. That's possibly still controversial.

ADD REPLYlink written 4.0 years ago by Michael Dondrup41k

I'll admit that I didn't see the date of the original answer :)

ADD REPLYlink written 4.0 years ago by jobinv1.1k
gravatar for Malachi Griffith
20 months ago by
Washington University School of Medicine, St. Louis, USA
Malachi Griffith15k wrote:

We make available open-access RNA-seq tutorials that cover cloud computing, tool installation, relevant file formats, reference genomes, transcriptome annotations, quality-control strategies, expression, differential expression, and alternative splicing analysis methods. These tutorials and additional training resources are accompanied by complete analysis pipelines and test datasets made available without encumbrance at

This material was released alongside this publication:

Malachi Griffith*, Jason R. Walker, Nicholas C. Spies, Benjamin J. Ainscough, Obi L. Griffith*. 2015. Informatics for RNA-seq: A web resource for analysis on the cloud.11(8):e1004393.

The Supplementary Information for this publication includes an extensive review of RNA-seq wet lab and analysis concepts, existing tools, common questions, etc.

All materials associated with this publication, including high resolution and original figure files, supplementary tables, etc. are available here:

This publication was inspired by workshops that we have taught at CBW, CSHL, and NYGC over the last few years.  These workshops are ongoing and we hope to maintain and expand the content in the coming years.

ADD COMMENTlink written 20 months ago by Malachi Griffith15k
gravatar for wadunn83
3.9 years ago by
wadunn8390 wrote:

For anyone still interested in this type of thing:

If using Tophat Cufflinks, the authors generally do not recommend removing poor quality reads since their process will simply down value the alignments of poor quality reads and sometimes they can actually help things.

As for 3-5:

I have recently written a pipeline called Blacktie to do just this, plus do some automated analysis with cummeRbund.

The project repo is at: The documentation: Bug Tracking and feature requests:

installation via pip:

[sudo] pip install -U blacktie
ADD COMMENTlink written 3.9 years ago by wadunn8390

Could you give a source for the top statement about pre-filtering reads for tophat? I've been trying to learn about this topic and haven't found a whole lot honestly.

ADD REPLYlink written 3.5 years ago by kipp40
gravatar for Biojl
4.0 years ago by
Biojl1.4k wrote:

You may want to take a look to The Simple Fool’s Guide to Population Genomics via RNA-Seq done at the PALUMBI lab. It's a functional fully documented pipeline from 0.

Edit PD: OK, yes, I didn't saw this post was from 3 years ago.

ADD COMMENTlink modified 4.0 years ago • written 4.0 years ago by Biojl1.4k
gravatar for xiangwulu
3.9 years ago by
xiangwulu0 wrote:
  1. fastqc could be used for the quality control
  2. adptor may need to be removed before the alignment, in case the long adaptor affects the aligning result
  3. & 4 other aligner may worth to look at depends on the length of the reads. (BWA, Bowtie, Bfast)

list of alignment software:

list of adaptor removal software:

ADD COMMENTlink written 3.9 years ago by xiangwulu0
gravatar for Czh3
2.2 years ago by
Czh3140 wrote:

This pipeline can help your do quality control, cut adapter, mapping, transcript assemble, different expression gene detecting... 

try this:

ADD COMMENTlink modified 2.2 years ago • written 2.2 years ago by Czh3140
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1411 users visited in the last hour