Question: Is combining fastq files before running tophat the same as combining bam file output from tophat?
gravatar for colin.kern
5.1 years ago by
United States
colin.kern920 wrote:

If we have two RNA-Seq libraries and run tophat on each of them, then combine the resulting bam files and run cufflinks on that, will that produce the exact same result as combining the fastq files before running tophat? I know that it wouldn't be the same to combine the results after cufflinks, since a transcript may not be able to be built from reads in a single library, but combining reads from different libraries would allow it to be assembled. I'm wondering if there is something similar with tophat.

rna-seq tophat • 2.6k views
ADD COMMENTlink modified 5.1 years ago by Sam3.0k • written 5.1 years ago by colin.kern920
gravatar for Sam
5.1 years ago by
New York
Sam3.0k wrote:

It depends on what you mean by combining. 

If, for example, you are combining different lane / run of the same sample, then it should be the same whether if you merge the bam or fastq file. However, you might need to make sure the read group setting allow you to specify them as the same samples.

However, if your are trying to combine different samples, then you should not combine them before running tophat unless your main goal is the detection of novo transcripts. The problem is that if you merge the fastq before the alignment, you will lost the information of origin, e.g. you don't know if read A is from sample 1 or sample 2. 

Now if you are trying to detect novo transcripts, Trinity does suggest the merging of fastq file before the denovo assembly for the reason you've mentioned: The novo transcirpt might only be partially captured in single library. So according to my experience (which was 2 years ago, might have changed now), to detect the novo-transcripts, you will merge the fastq file and try to construct the novo-transcripts, then you align the reads of individual samples back to the novo-transcript list to get per individual alignment info

ADD COMMENTlink modified 5.1 years ago • written 5.1 years ago by Sam3.0k

We have 8 tissue types from 2 replicates, so a total of 16 samples. We've run Tophat/cufflinks on these samples and are getting ~30 million reads aligned and expression of ~15,000 annotated genes. What we're trying to determine now is if it will be worth getting more reads from these samples, so our idea is to combine the reads of the same tissue types, collapsing it down to 8 "samples", and then redoing the analysis to see if that increases the number of expressed genes we detect. Since tophat takes a while to run, I was wondering if I could use the bam files I've already generated and just combine them, or whether it should be done before. So we are not concerned with losing the information of the origin since we're essentially combining reads from two replicates to create a virtual single sample.

ADD REPLYlink written 5.1 years ago by colin.kern920

My recommendation will be something simpler. When running cufflinks, you can state the status of each samples, e.g. Case / Control. Instead of giving the individual tissue + replicate types, you can simply give all the samples from the same tissue or the same replicates the same label. The reason behind this is that by combining the samples into one data, the statistic analysis will lose power because you have less samples. Whereas by giving the same sample labels, the statistic tools can take into account for the variation between different samples and therefore give better estimation. 

As mentioned before, unless you want to detect novel transcripts or transcripts with extremely low expression values, you wouldn't need to worry too much about the read length.

ADD REPLYlink written 5.1 years ago by Sam3.0k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1519 users visited in the last hour