Question

What is the best pipeline step to merge replicates?

1

Entering edit mode

5.9 years ago

phosphodiester_bond ▴ 40

Based on your experience, at which point would you recommend merging the reads from multiple ATAC-seq replicates (biological replicates, for the most part)? And most importantly, why?

I saw at Encode they first process the samples independently up to the BAM files, and merge all BAM files using samtools. What would you lose/gain by merging reads at the fastq level? Or would you rather ignore all of this and merge the results at the very end of the pipeline? (say, after you've called peaks on each of these samples independently) My intuition suggests to merge these files right after trimming reads, but I'm curious about the reasoning for your choice of merge step.

Thanks for any feedback!

pipeline sequencing • 4.7k views

ADD COMMENT • link updated 5.9 years ago by Friederike 8.9k • written 5.9 years ago by phosphodiester_bond ▴ 40

0

Entering edit mode

It is not clear what kind of experiment you talk about, RNA-seq? ChIP-seq? ATAC-seq? And what kind of replicates? Biological? Technical? What is the research question?

ADD REPLY • link 5.9 years ago by Benn 8.3k

0

Entering edit mode

Sorry about the extreme vagueness, I just edited my original post. This is ATAC-seq data that I'm pulling from GEO accessions, all uploaded by other labs. Most of these are biological replicates.

ADD REPLY • link 5.9 years ago by phosphodiester_bond ▴ 40

score 3 · Answer 1 · 2018-06-12

3

Entering edit mode

5.9 years ago

Friederike 8.9k

at which point would you recommend merging the reads from multiple replicates

When I'm sure that the replicates are sufficiently similar and I have decided that I don't need the information that might be gleaned from treating the replicates independently.

As b.nota's comment illustrates: there is not going to be a single answer to this. It depends on why you did the replicates in the first place, how much the replicates truly mimic each other, and what downstream analyses you're going to do.

ADD COMMENT • link 5.9 years ago by Friederike 8.9k

0

Entering edit mode

Thank you, I see now how this is probably not something with a generalizable solution (which is why I kept my original post intentionally vague).

ADD REPLY • link 5.9 years ago by phosphodiester_bond ▴ 40

score 1 · Answer 2 · 2018-06-12

I saw at Encode they first process the samples independently up to the BAM files, and merge all BAM files using samtools. What would you lose/gain by merging reads at the fastq level?

Maybe it's because you still want to know which sample failed the sequencing (if one or several did). That would be easier to detect if you aligned them separately since the alignment rate will be a good proxy for how well your sequencing went. Plus the additional QC for ATAC-seq such as mitochondrial contamination, fragment size distribution and genome coverage are easier done at the BAM level.

Or would you rather ignore all of this and merge the results at the very end of the pipeline?

If these are biological replicates (and from different labs! That's a guaranteed batch effect right there!), why would you want to merge them at all? I would probably try to keep the samples separate for the most part. If you are going to go for differential accessibility analysis, csaw and diffBind are both tools that could make use of replicates to gauge the biological (and technical!) variability.