Question: I have demultiplexed files for a single biological replicate. When to combine them in pipeline?
1
gravatar for Kristin Muench
4.7 years ago by
United States
Kristin Muench450 wrote:

I have a database of 10 distinct biological samples. Each of these samples was sequenced (RNA-Seq) using paired-end reads and 6 barcodes. Thus, for each of the ten biological samples, I have 12 .fastq files, with names like *ATACTC_1, *ATACTC_2, *GTGCTC_1, *GTGCTC_2...and so on.

I would like to follow this pipeline:

  1. Analyze data quality with FastQC
  2. Trim data with Trim Galore!
  3. Align with TopHat2
  4. ??? generate counts, differential expression analysis, etc.

Here is my question: at what point in this pipeline can I (should I) combine all of the .fastq data together? If each sample has 12 files associate it, at what point do I collapse the 12 files into a single file representing the RNA-Seq data for a single biological sample that I can analyze for counts in step #4?

I'm guessing I combine all of the .fastq files up front (re-multiplex?) with >>cat file1...file12. I could also do steps #1-2 or steps #1-3 completely, and then combine the output of step #3.

Thank you for any help you can provide! This board has already been tremendously helpful to me.

fastqc rna-seq • 3.2k views
ADD COMMENTlink modified 4.7 years ago by matted7.1k • written 4.7 years ago by Kristin Muench450
2
gravatar for matted
4.7 years ago by
matted7.1k
Boston, United States
matted7.1k wrote:

I think it's most typical to combine raw reads if they correspond to the same library and original sample (so merging technical replicates, with respect to the sequencer, and not biological replicates).  This paper was one of the early ones to test and validate this assumption.

So for your outlined process, that could be anywhere in steps 1 to 3.  Personally, I would make count tables for all the 6*10 runs separately and then combine the count tables at the very end, before any clustering or differential analysis (so in the middle of your step 4).  This is because all the earlier steps can be performed in parallel, and you might save some time by processing many chunks at once.

And just for completeness, you'll need to combine the two matching paired end fastq files (e.g. X_1 and X_2) before aligning.  You might need to combine them or analyze them together for the adapter trimming, or possibly trim adapters for each read end separately.

ADD COMMENTlink written 4.7 years ago by matted7.1k

Ah, that's so helpful! Thank you very much.

ADD REPLYlink written 4.7 years ago by Kristin Muench450

Actually, a question for clarification: I suspect that the 6 barcoded runs for each sample are not technical replicates, but actually 1/6 the volume of the total sample library (so the libraries are not the same, although they all come from the same sample). In that case, do need to analyze each file separately and then treat 'sample' as a cofactor in any differential expression analysis, or is there still a way to combine the data?

 

EDIT: Oops - it occurs to me that this still could fit the definition of a technical replicate, in which case it would be fine to combine just as you suggested.

ADD REPLYlink modified 4.7 years ago • written 4.7 years ago by Kristin Muench450

That's a good question about the separate library replicates.  To be honest, it's somewhat of an unusual design (to me), and so I'm not positive what the best thing to do is.  If I wanted to be completely thorough, I'd do an analysis where I kept the 6 technical replicates separate first, and perform a 6 vs. 6 (and then 6 vs. 6 vs. 6 vs. ...) analysis.  I'd also do some checks to make sure the 6 library replicates are always similar.

If the sequencing coverages are similar and there isn't batch-to-batch variation in the library preparation, my intuition says that just adding counts should be fine.  You could maybe start to justify that by observing that the sum of negative binomials is still a negative binomial in certain circumstances, and your experimental setup satisfies those assumptions.

ADD REPLYlink written 4.7 years ago by matted7.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2417 users visited in the last hour