Advice on Handling Batch Effect in ChIP-seq Experiment
1
1
Entering edit mode
19 months ago
dmj6ab ▴ 10

I am currently working on a ChIP-seq experiment with a collaborator and I'm running into problems that may be due to the overall experimental design.

So for an overall background, I was brought on as a bioinformatics analyst after the experiment was ordered and sequenced. From what I know so far there were four experimental groups and two replicates for each experimental group. The lab did not do any input controls of any type and the experiment was done in two batches, one replicate for each experimental group in each batch.

I've run into batch effect issues before but they have all been RNA-seq related, as I do more of that type of analysis. This is my first ChIP-seq experiment and I really like the more technical aspects of the workflow, but there are so many options at each stage of pre-processing and downstream analysis that having input and opinions from people that have done this before is always great. So here is what I am doing to each sample: (these are Illumina platform paired-end reads of 150 bp I believe, they ordered the sequencing from a company that is outside our university so I don't have access to that info but I can get it)

Filtering Based on quality scores:

cmd = "zcat -c {infile} | grep -A 3 '^@.*[^:]*:N:[^:]*:' | grep -v '^\-\-\$' |gzip -c > {outfile}"


Paired-End Trimmomatic for Illumina type reads:

trimmomatic PE -phred33 {input[0]} {input[1]} {output[0]} {output[1]} {output[2]} {output[3]} ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36


Bowtie2 Paired-End Alignment using snakemake wrapper:

params:
index="mm9/mm9",  # prefix of reference genome index (built with bowtie2-build)
extra="-q --very-fast"  # optional parameters
wrapper:
"0.36.0/bio/bowtie2/align"


The command would look like this:

bowtie2 --threads 4 -q --very-fast -x mm9/mm9 {input} | samtools view -Sbh -o {output) {log}


Sort with Samtools: (a BAM File)

samtools sort {input} -o {output}


MACS2 Peak Calling:

macs2 callpeak -t {input} -n {output} -f BAM -g mm -B -q 0.01


Everything ran successfully and the MACS2-provided peak model and cross-correlation plots look clean, but when I ran the samples through phantompeakqualtools, the NSC values were around 1.04 and RSC were around 0.19 to 0.20, so terrible quality metrics. I realize that I didn't filter by reads that map uniquely to the genome, but when I did, I got only marginally better results. I also filtered by Bowtie2's MAPQ scores of 5, 10, and even 20 as per this guide's suggestions https://github.com/crazyhottommy/ChIP-seq-analysis/blob/master/part0_quality_control.md, but I got similar results. Bowtie2 was able to get over 60% of the reads to map to one point on the genome in all samples so I'm not too worried about the performance of the alignment (to mm9 genome) step.

The MACS2 results were... interesting... All samples had a grand total of 100-150 called peaks. My first hunch is that I might be dealing with a protein that has very few binding sites on the genome and that partially explains the low QC scores. If anyone has worked with it before, the name of the protein being immunoprecipitated is Rev-erb-alpha. I found similar ChIP-seq experiments that reported high variability in the number of called peaks for a relative of Rev-erb-alpha, Rev-erb-beta (from 1000 to 20000+ http://genesdev.cshlp.org/content/26/7/657.full.pdf). We are using CNS tissue, but I wouldn't be surprised if similar things are happening with Rev-erb-alpha in the brain. I can try to get more info on antibodies and sample prep/IP as well.

When I did downstream analysis, the samples were largely segregated by which batch they came from. A DiffBind analysis showed no differential binding among the MACS2 called peaks. I know that DiffBind uses DESeq2 to perform differential binding so I used limma's removebatcheffect function as a first stab at the problem to no avail.

I don't have any questions on any specific steps per-se, so I didn't attach data or code. I'm interested in how other people commonly deal with batch effect in low-rep ChIP-seq experiments. I also have a few other general questions:

1. Since I have paired-end reads, should I do a paired-end alignment with Bowtie2 or should I do a single end alignment for each end and then feed them into MACS2 to perform paired-end peak-calling? Would this improve quality metrics?

2. At what stage is a good stage to deal with batch effect and does anyone have software recommendations? I'm still getting familiar with ChIP-seq in general so I'd like to learn about the space and best practices in the field.

3. Are there any glaring issues with the parameters I provided in my pre-processing workflow? I'm currently in the middle of reviewing them and I'll be tweaking them as I refine the pipeline.

4. If I can't eliminate batch effect, are there analyses that give similar insights to diff binding analysis that I could still do with single replicates?

If you need any additional info I'd be happy to provide it, thanks! I did look all over google but I haven't found too many posts or papers directly dealing with batch effect WITHIN a single ChIP-seq experiment, but some dealing with meta-analyses with higher replicates.

2
Entering edit mode
19 months ago
colin.kern ▴ 970

Even if your protein has few binding sites, you should still get good scores for NSC and RSC. Unfortunately, given everything you've provided I think the ChIP assay just didn't work very well. The protocol is difficult and can require a lot of troubleshooting and repeating of various steps to avoid producing very noisy data. One additional thing you can do to confirm if data quality is the issue would be to calculate the fraction of reads in peaks (FRiP). This is essentially telling you what proportion of your reads are coming from an actual binding event (signal) instead of noise. Another possibility is that there is a very high amount of PCR duplication. I notice you didn't mark and remove duplicates in your pipeline, which is a fairly standard procedure for ChIP-seq. If you do that and most of your reads are removed, that's a sign of over-amplification. I also notice that you don't seem to have any input/control libraries. Did the lab that generated the data create these? It's generally recommended for ChIP-seq.

0
Entering edit mode

Thanks for the insight. FRiP was one of the next few things that I was going to look at. You're right about PCR duplication, there were an obscenely high (Given that this is TF CHiP-seq) number of reads in each sample (~30 million), so PCR duplication is almost a guarantee to some extent at least. When I was testing phantompeakqualtools on some of my samples and I did do filtering for high MAPQ scores (5, 10, 20) the number of reads retained dropped to ~24-22 million and NSC and RSC didn't improve at all really. Do you have a specific strategy for removing PCR duplicates? I'm using samtools currently.

And yes! There should have been at least one input control library but my collaborators didn't prepare one. Also, have you ever worked with a CHiP-seq experiment like mine in which both replicates for all experimental groups are split between batches? This was another frustrating aspect of the experimental design for me.

1
Entering edit mode

You can use the samtools markdup command to remove duplicates. This is different from filtering by MAPQ scores, which is filtering multi-mapped reads and other low quality alignments. Another option is Picard Tools MarkDuplicates, which is equivalent to samtools unless you have multiple read groups in your bam files, in which case Picard will only remove duplicates within the same read group, whereas samtools removes all duplicates in the bam file, so ignores read group tags.

I have worked with a lot of ChIP-seq data that was generated by various people from different labs, but I unfortunately don't have a good way of removing any specific batch effects.

0
Entering edit mode

Thanks Colin for all the help. I checked with my collaborator recently and they shared the troubleshooting data with me. It seems that the antibody they were using for rev-erba had poor performance (low affinity) and the fragment lengths were anything but uniform (anywhere from 100 bases up to 1kb), this is reflected in the reports from spp/phantompeakqualtools.

For batch effect control, I ended up using the csaw R package to segment the genome into windows and get the number of overlapping reads for each window. I then used the resulting count matrix with their workflow which incorporates edgeR. I also used DESeq2 on the same counts matrix. Both edgeR and DESeq models included a batch effect parameter. I didn't get great results and the p-values were high, but it was nice to see that that the significant results obtained with edgeR partially reflected those obtained using DESeq2.

I suggested that if they attempted to repeat the experiment, then it would be a good idea to include at least one input control sample. I also made it clear that library prep and sequencing should be done in one batch whenever possible.