Question: Rna-Seq Pipeline
gravatar for brentp
9.3 years ago by
Salt Lake City, UT
brentp23k wrote:

So, there're papers on designing an RNA-seq experiment, and normalizing the data (Bullard et. al and the recent Genetics paper are good reads) but what do folks do for the actual pipeline.

I'm looking at

  1. filter on quality. (what are your quality/parameter cutoffs?)
  2. any other pre-processing?
  3. tophat
  4. cufflinks
  5. repeat 1-4 for different set of reads and find differentially expressed genes (cuffdiff)

First, any steps I should add?

Second, there doesn't seem to be much about how to do this. I mean I can read the manuals and execute the commands (steps 3, 4 seem no problem), but I'm looking any pointers to either:

  1. fully documented pipelines with a explanation of the processing at each step
  2. shell script(s) of going from reads to differentially expressed genes.
  3. pubs where this is documented.

I realize each set of data will be different, but it'd be nice to base it on something.

ADD COMMENTlink modified 4.0 years ago by Malachi Griffith17k • written 9.3 years ago by brentp23k
gravatar for Dstan
9.3 years ago by
Provo, Utah, USA
Dstan160 wrote:

We're getting ready to publish a study in which we use RNA-seq, and we used a piece of software called GNUMAP. We did not apply any filtering on the read qualities, as we found that lower-quality reads simply didn't map as well. As far as the post-mapping analysis, we're still waiting to hear back from our statistics colleagues on the model they've developed.

As far as an out-of-the-box solution for RNA-seq, I'm not sure how much you'll be able to find.

ADD COMMENTlink written 9.3 years ago by Dstan160

hadn't heard of GNUMAP, checking it out now. i'm not expecting an out-of-the-box solution, just trying to make use of existing knowledge.

ADD REPLYlink written 9.3 years ago by brentp23k
gravatar for Wjeck
9.3 years ago by
Chapel Hill, NC
Wjeck480 wrote:

No idea about where these steps exist as a well documented whole, but I can pass on our experience. We're doing a pretty massive amount of RNA-seq at our institution as part of The Cancer Genome Atlas, and our methods are along the lines you describe.

Bowtie/Tophat for mapping has been our best bet for spliced sequence alignment. I know the group working on this tried other techniques with mapping onto a reference "transcriptome" that has some advantages in terms of mapping but can be harder to deconvolute in cases where transcripts overlap.

ADD COMMENTlink written 9.3 years ago by Wjeck480

thanks, at least it's good to know you decided on a similar overall pipeline after looking around.

ADD REPLYlink written 9.3 years ago by brentp23k
gravatar for Michael Dondrup
9.2 years ago by
Bergen, Norway
Michael Dondrup46k wrote:

I think, one important step that is missing here could be

  1. remove/condense (100%?) identical reads into one read

in the filtering step. A large amount of reads could be e.g. artifacts from a PCR step in the wet-lab pipeline. This can be done e.g. with the tool FASTA collapser from the FASTX tools. For a quantitative approach I would prefer this, but I guess it's controversial. Any experiences with that?

Another filtering step can be to clip the reads removing low-quality regions instead of removing only total reads.

ADD COMMENTlink modified 9.2 years ago • written 9.2 years ago by Michael Dondrup46k

My understanding is that removing identical reads is a step that is typical for DNA analysis, but more controversial when it comes to RNA-Seq because the rationale for it is less clear here (are we only removing PCR artifacts, or also introducing a quantitative bias?).

ADD REPLYlink written 6.3 years ago by jobinv1.1k

Note, I wrote this almost 3 years ago. Now, I wouldn't do it anymore for a differential analysis, with the argument that on average PCR-artifacts should equally affect both conditions. That's possibly still controversial.

ADD REPLYlink written 6.3 years ago by Michael Dondrup46k

I'll admit that I didn't see the date of the original answer :)

ADD REPLYlink written 6.3 years ago by jobinv1.1k
gravatar for Malachi Griffith
4.0 years ago by
Washington University School of Medicine, St. Louis, USA
Malachi Griffith17k wrote:

We make available open-access RNA-seq tutorials that cover cloud computing, tool installation, relevant file formats, reference genomes, transcriptome annotations, quality-control strategies, expression, differential expression, and alternative splicing analysis methods. These tutorials and additional training resources are accompanied by complete analysis pipelines and test datasets made available without encumbrance at

This material was released alongside this publication:

Malachi Griffith*, Jason R. Walker, Nicholas C. Spies, Benjamin J. Ainscough, Obi L. Griffith*. 2015. Informatics for RNA-seq: A web resource for analysis on the cloud.11(8):e1004393.

The Supplementary Information for this publication includes an extensive review of RNA-seq wet lab and analysis concepts, existing tools, common questions, etc.

All materials associated with this publication, including high resolution and original figure files, supplementary tables, etc. are available here:

This publication was inspired by workshops that we have taught at CBW, CSHL, and NYGC over the last few years.  These workshops are ongoing and we hope to maintain and expand the content in the coming years.

ADD COMMENTlink written 4.0 years ago by Malachi Griffith17k
gravatar for wadunn83
6.3 years ago by
wadunn8390 wrote:

For anyone still interested in this type of thing:

If using Tophat Cufflinks, the authors generally do not recommend removing poor quality reads since their process will simply down value the alignments of poor quality reads and sometimes they can actually help things.

As for 3-5:

I have recently written a pipeline called Blacktie to do just this, plus do some automated analysis with cummeRbund.

Installation via pip:

[sudo] pip install -U blacktie
ADD COMMENTlink modified 12 months ago by RamRS24k • written 6.3 years ago by wadunn8390

Could you give a source for the top statement about pre-filtering reads for tophat? I've been trying to learn about this topic and haven't found a whole lot honestly.

ADD REPLYlink written 5.8 years ago by kipp40
gravatar for Biojl
6.3 years ago by
Biojl1.7k wrote:

You may want to take a look to The Simple Fool’s Guide to Population Genomics via RNA-Seq done at the PALUMBI lab. It's a functional fully documented pipeline from 0.

Edit PD: OK, yes, I didn't saw this post was from 3 years ago.

ADD COMMENTlink modified 6.3 years ago • written 6.3 years ago by Biojl1.7k
gravatar for xiangwulu
6.3 years ago by
xiangwulu60 wrote:
  1. fastqc could be used for the quality control
  2. adptor may need to be removed before the alignment, in case the long adaptor affects the aligning result
  3. & 4 other aligner may worth to look at depends on the length of the reads. (BWA, Bowtie, Bfast)

list of alignment software:

list of adaptor removal software:

ADD COMMENTlink written 6.3 years ago by xiangwulu60
gravatar for Czh3
4.5 years ago by
Czh3190 wrote:

This pipeline can help your do quality control, cut adapter, mapping, transcript assemble, different expression gene detecting... 

try this:

ADD COMMENTlink modified 4.5 years ago • written 4.5 years ago by Czh3190
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1770 users visited in the last hour