Question: Merging replicates from Encode/Roadmap project
gravatar for curious
4.5 years ago by
curious50 wrote:


I'm processing data from the Encode project to look at the enhancer-promoter interactions. I would like to merge the replicates (technical/biological) for a given mark and cell type.

I'm not sure how to go about merging the replicates. [1] says that 'Filtered datasets were then merged appropriately (technical/biological replicates) to obtain a single consolidated sample for every histone mark or DNase-seq in each standardized epigenome.' The paper that explains [1] is this one but that doesn't explain how merging is done either.

Should technical replicates merged together and should biological replicates merged together and not in between?

The pipeline I created is: sra->fastq->fastq_trimmed->sam->bam->bam_sorted->counts I'm trimming the unmapped reads so the data from the samples are uniform (36bp). I derive region counts using bedtools' genomecov option.

thanks for reading.

chip-seq encode • 1.6k views
ADD COMMENTlink modified 4.5 years ago by John12k • written 4.5 years ago by curious50

This doesn't answer the question, but I had a very similar question recently and I asked a post-doc about the difference between biological and technical replicates.

Essentially biological replicates will have much larger variance than technical replicates, so it does not make much sense to merge biological replicates together. Instead this post talks about needing biological replicates to estimate variance and dispersion of data.

For technical replicates I usually merge their fastq files together using cat prior to trimming and alignment, though I remember reading somewhere on Biostars that it might be better to run the trimming and alignment -> sorted bam and then merge replicates at that point. Someone else can clarify this i'm sure.

ADD REPLYlink written 4.5 years ago by Sinji3.0k

thanks for the insight on merging technical replicates, Sinji. One paper discusses the difference between technical and biological replicates.

this actually reminds me of another question that I forgot to include earlier: how do I differentiate between a technical and a biological replicate? I've an automated pipeline to process about 3000 files and would like to automatically identify technical/biological replicates. I used the R packages GEOquery and GEOmetadb but they don't quite give out replicate information as far as I can tell.

It would be great to know if there are others.

ADD REPLYlink written 4.5 years ago by curious50
gravatar for John
4.5 years ago by
John12k wrote:

I am by no means the expert here, but if the question is simply "what did Encode do?" and "what should I do?", I can probably take a shot at it :)

For Encode when they say they merged replicates to get a consolidated sample, they mean they merged the BAM files with samtools merge or similar. From the data they produced, this is most likely a fine thing to do, because read-depth wasn't super-high back then and variance between individual ChIP/DNAse sequencing runs is significantly lower than RNA - particularly at the read numbers they were mapping at. For RNA-Seq however, there is essentially no good reason I can think of to merge anything.

Regarding "what should I do", that's a much more interesting question :-) The reason is, there aren't many tools (that I know of) that make use of ChIP/DNAse replicate information. Most of the time, we end up merging everything together and treating all the reads the same. Of course we do the QC of the reads individually - looking at tracks individually in IGV or producing heat maps per-replicate - and in some scenarios we'll do Input/GC-bias correction at the run level and then merge at higher-level (signal bin counts), but only because we can't merge at the read level in those situations.

However, you never know what breakthrough is around the corner, so it would be very foolish of me to suggest replicate information isn't important. Fortunately, you can still have your cake and eat it too, by merging reads together into 1 BAM file, but tagging the reads with an RGID that is specific to their biological/technical replicate group. How useful this is depends on if software can understand the RGID field and do something useful with it. The only software I know that does is GATK, which will use the technical replicate information to model the quality of the sequencing when calling SNPs. Other than GATK though, I don't know of any software that does anything useful with replicate information - but maybe others can chime in with examples :)

ADD COMMENTlink written 4.5 years ago by John12k

thanks for the detailed reply John.

"what did Encode do" and "what should I do"? you're spot on.

Regarding RNA-seq, I posted a question in the past on processing data processed from ABI SOLiD technology. The answers suggest inaccuracy in the way they're converted from colorspace to basespace. Thanks for the pointers on IGV. I'm looking at processing ~3000 such files. Ideally, I would want to automate the quality check . Currently, I'm using ChIPUtils, an R package, that determines the PCR bottleneck coefficient and normalized strand correlation coefficient. Tagging the reads with RGID is an interesting approach and I'll look further into GATK.

ADD REPLYlink written 4.5 years ago by curious50
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1087 users visited in the last hour