I am currently working on a ChIP-seq experiment with a collaborator and I'm running into problems that may be due to the overall experimental design.
So for an overall background, I was brought on as a bioinformatics analyst after the experiment was ordered and sequenced. From what I know so far there were four experimental groups and two replicates for each experimental group. The lab did not do any input controls of any type and the experiment was done in two batches, one replicate for each experimental group in each batch.
I've run into batch effect issues before but they have all been RNA-seq related, as I do more of that type of analysis. This is my first ChIP-seq experiment and I really like the more technical aspects of the workflow, but there are so many options at each stage of pre-processing and downstream analysis that having input and opinions from people that have done this before is always great. So here is what I am doing to each sample: (these are Illumina platform paired-end reads of 150 bp I believe, they ordered the sequencing from a company that is outside our university so I don't have access to that info but I can get it)
Filtering Based on quality scores:
cmd = "zcat -c {infile} | grep -A 3 '^@.*[^:]*:N:[^:]*:' | grep -v '^\-\-$' |gzip -c > {outfile}"
Paired-End Trimmomatic for Illumina type reads:
trimmomatic PE -phred33 {input[0]} {input[1]} {output[0]} {output[1]} {output[2]} {output[3]} ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
Bowtie2 Paired-End Alignment using snakemake wrapper:
params:
index="mm9/mm9", # prefix of reference genome index (built with bowtie2-build)
extra="-q --very-fast" # optional parameters
threads: 4
wrapper:
"0.36.0/bio/bowtie2/align"
The command would look like this:
bowtie2 --threads 4 -q --very-fast -x mm9/mm9 {input} | samtools view -Sbh -o {output) {log}
Sort with Samtools: (a BAM File)
samtools sort {input} -o {output}
MACS2 Peak Calling:
macs2 callpeak -t {input} -n {output} -f BAM -g mm -B -q 0.01
Everything ran successfully and the MACS2-provided peak model and cross-correlation plots look clean, but when I ran the samples through phantompeakqualtools, the NSC values were around 1.04 and RSC were around 0.19 to 0.20, so terrible quality metrics. I realize that I didn't filter by reads that map uniquely to the genome, but when I did, I got only marginally better results. I also filtered by Bowtie2's MAPQ scores of 5, 10, and even 20 as per this guide's suggestions https://github.com/crazyhottommy/ChIP-seq-analysis/blob/master/part0_quality_control.md, but I got similar results. Bowtie2 was able to get over 60% of the reads to map to one point on the genome in all samples so I'm not too worried about the performance of the alignment (to mm9 genome) step.
The MACS2 results were... interesting... All samples had a grand total of 100-150 called peaks. My first hunch is that I might be dealing with a protein that has very few binding sites on the genome and that partially explains the low QC scores. If anyone has worked with it before, the name of the protein being immunoprecipitated is Rev-erb-alpha. I found similar ChIP-seq experiments that reported high variability in the number of called peaks for a relative of Rev-erb-alpha, Rev-erb-beta (from 1000 to 20000+ http://genesdev.cshlp.org/content/26/7/657.full.pdf). We are using CNS tissue, but I wouldn't be surprised if similar things are happening with Rev-erb-alpha in the brain. I can try to get more info on antibodies and sample prep/IP as well.
When I did downstream analysis, the samples were largely segregated by which batch they came from. A DiffBind analysis showed no differential binding among the MACS2 called peaks. I know that DiffBind uses DESeq2 to perform differential binding so I used limma's removebatcheffect function as a first stab at the problem to no avail.
I don't have any questions on any specific steps per-se, so I didn't attach data or code. I'm interested in how other people commonly deal with batch effect in low-rep ChIP-seq experiments. I also have a few other general questions:
Since I have paired-end reads, should I do a paired-end alignment with Bowtie2 or should I do a single end alignment for each end and then feed them into MACS2 to perform paired-end peak-calling? Would this improve quality metrics?
At what stage is a good stage to deal with batch effect and does anyone have software recommendations? I'm still getting familiar with ChIP-seq in general so I'd like to learn about the space and best practices in the field.
Are there any glaring issues with the parameters I provided in my pre-processing workflow? I'm currently in the middle of reviewing them and I'll be tweaking them as I refine the pipeline.
If I can't eliminate batch effect, are there analyses that give similar insights to diff binding analysis that I could still do with single replicates?
If you need any additional info I'd be happy to provide it, thanks! I did look all over google but I haven't found too many posts or papers directly dealing with batch effect WITHIN a single ChIP-seq experiment, but some dealing with meta-analyses with higher replicates.