Comparing two ChIP-seq libraries
3
4
Entering edit mode
9.4 years ago
ritarebollo ▴ 70

Hello,

I am new to ChIP-seq data mining (by myself) and I am having a hard time comparing two ChIP-seq data sets. Could you tell me if my pipeline is correct and if I should add a normalization step somewhere? I am simply going through this with each one of my IP-Input files and looking at the genome browser. The problem is the IP of sample B is sequenced at a bigger depth than sample A and it seems the peaks are not all being called. When I compare to previously published data it seems I am missing some peaks on sample B. I thought about changing some thresholds on sample B but then the pipeline would be different from sample A... So, what should I do to ensure all my peaks are present in sample B but the data is still comparable to sample A? Thanks!!!

Samples: H3K4me3 and Input data on sample A and H3K4me3 and Input data on sample B.

Original file: BAM (from the sequences).

My pipeline:

  • Sort pair-ended BAM files by read name: samtools sort -n c3h.input.bam c3h.input.sorted
  • Convert BAM files to FASTQ format: bam2fastx -q -Q -A -o c3h.input.fastq -P -N c3h.input.sorted.bam
  • Map reads to the genome: bowtie2 -p 8 -x ~/Data/Genomes/mm10/bt2idx/bt2idx -1 c3h.input.1.fastq -2 c3h.input.2.fastq -S c3h.input.sam&

    Bowtie2 options:

    End-to-end alignment (default) vs. local alignment (--local)

    Paired-end alignment: -1 mate1.fq -2 mate2.fq

    Concordant + discordant alignment (default) vs. concordant only (--no-discordant)

    Mixed mode: paired where possible; otherwise unpaired (default) vs. paired-only mode (--no-mixed)

    Overlapping with the other mate is concordant (default) vs. (--no-overlap)

    Containing the other mate is concordant (default) vs. (--no-contain)

    Dovetailing the other mate is discordant (default) vs. (--dovetail)

    Bowtie2 output:

    Mapping quality: MAPQ

  • Peak analysis using MACS2: macs2 callpeak -t ~/Data/BCCRC/chipseq/mapping/tt2.k4m3.sam -c ~/Data/BCCRC/chipseq/mapping/tt2.input.sam -f SAM -g mm -n tt2.k4m3 -B -q 0.01 --outdir ~/Data/BCCRC/chipseq/tt2.k4m3.out

    MACS2 options:

    -g (genome size, mm=1.87e9)

    -B/-bdg (bedGraph format for pile-up files)

    -q (The qvalue (minimum FDR) cutoff to call significant regions)

ChIP-Seq normalization samples • 11k views
ADD COMMENT
5
Entering edit mode
9.4 years ago
Ying W ★ 4.2k

I think it would be helpful if you tell us what is the biological question you are trying to answer

I am guessing what you are really trying to do is identify differential H3K4me3 occupancy between TT2 and C3H mice cell lines.

The problem is the IP of sample B is sequenced at a bigger depth than sample A and it seems the peaks are not all being called.

Did you mean IP-input of sample B is greater or less sequencing depth for IP of A? You mention that peaks in B are not being called when compared to previous results but you must keep in mind there is quite a bit of variability with chip-seq due to the number of factors that need to be controlled for.

Could you list # of raw reads and map % for your IP and IP-inputs for both samples?

Anyways, instead of trying to 'fix' the problem by mapping differently, normalization is typically done at the analysis stage. If you are interested in looking for differential histone occupancy, you will want to be using the diffpeak commands in macs2 or a more specialized tool such as RSEG

If you are interested in how normalization is done for ChIP-Seq, there are a couple papers out (and hopefully mine will be published eventually) such as NCIS and MAnorm

Here is some useful background reading:

A quick search also turned up this recent paper (which I haven't read) that might be of interest to you

Impact of sequencing depth in ChIP-seq experiments

ADD COMMENT
2
Entering edit mode
9.4 years ago
Ian 6.0k

I run bowtie2 as you have and then filter the resulting SAM files:

samtools view -f 2 -q 30 -bS in.sam > out.bam

-f 2 retains only concordant read pairs, -q 30 retains reads with mapping score of at least Q30. The latter means you can can have reads that map to multiple locations, but the mapping quality is very significant.

I recommend using the --SPMR flag in MACS2 which normalises the coverage plot values by millions of reads. This allows you to compare visually between tracks if the samples are very different in terms of total reads.

ADD COMMENT
0
Entering edit mode

Thanks for the reply!

After trying what you suggested, MACS2 failed with an error message "Too few paired peaks (0) so I can not build the model! Broader your MFOLD range parameter may erase this error. If it still can't build the model, we suggest to use --nomodel and --extsize 147 or other fixed number instead." I checked the default MFOLD range and it is 5 - 50, which should be good enough. Any suggestions?

ADD REPLY
0
Entering edit mode

The mfold range still relies upon the ChIP sample being adequately enriched, if not using --nomodel --extsize N will allow you to proceed. I was unclear whether you had paired reads or not. If you do then mfold is not required. If you are working with single-end reads cross-correlation can fail (be mislead) if too many reads fall into blacklist regions (https://sites.google.com/site/anshulkundaje/projects/blacklists). So removing such reads prior to peak calling can help.

ADD REPLY
1
Entering edit mode
9.4 years ago
pawelsm ▴ 10

Hi ritarebollo,

1)

Could you tell me if my pipeline is correct and if I should add a normalization step somewhere?

On bowtie2 step do you want to analyze any mapping reads or only uniquely mapping reads?

Check this site: http://biofinysics.blogspot.de/2014/05/how-does-bowtie2-assign-mapq-scores.html

or just remove reads with "XS:i:" annotation by for example

grep -v XS:i: file.sam

Normalizing or scaling is not very safe as read coverage is not really linear. You may consider generic quantile normalization.

2)

You may also use some software which allows for integration of biological replicates. One of such packages is jamm.

Cheers,
Pawel

ADD COMMENT
0
Entering edit mode

When I first started mapping PE reads I wanted to use uniquely mapped reads and used the "XS:i" method. However after comparing this method and the mapping quality method (that I mention) the quality of the former was not as good (i.e. less binding region with good fold enrichment). Also, grep'ing XS:i will leave you with a SAM file where there are orphaned reads of a pair (in my experience).

ADD REPLY
0
Entering edit mode

Yes, no biological replicates here (fail... next time though). For the uniquely mapping reads to be fair, they are not a must for me now. I'd rather use the entire data and check later for the uniquely mapping reads to verify the peaks are still there. Some people also suggested to simply to absolute value comparisons between my two samples as I am simply look for presence or absence of peaks. But since the noise/background are so different between the samples, I am a bit lost. I'll keep trying though. Thanks!

ADD REPLY

Login before adding your answer.

Traffic: 3252 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6