Question: Comparing two ChIP-seq libraries
3
gravatar for ritarebollo
3.1 years ago by
ritarebollo50
Canada
ritarebollo50 wrote:

Hello,

I am new to ChIP-seq data mining (by myself) and I am having a hard time comparing two ChIP-seq data sets. Could you tell me if my pipeline is correct and if I should add a normalization step somewhere? I am simply going through this with each one of my IP-Input files and looking at the genome browser. The problem is the IP of sample B is sequenced at a bigger depth than sample A and it seems the peaks are not all being called. When I compare to previously published data it seems I am missing some peaks on sample B. I thought about changing some thresholds on sample B but then the pipeline would be different from sample A... So, what should I do to ensure all my peaks are present in sample B but the data is still comparable to sample A? Thanks!!!

 

Samples : H3K4me3 and Input data on sample A and H3K4me3 and Input data on sample B.

Original file : BAM (from the sequences).

My pipeline :

  • Sort pair-ended BAM files by read name

samtools sort –n c3h.input.bam c3h.input.sorted

  • Convert BAM files to FASTQ format

bam2fastx -q -Q -A -o c3h.input.fastq  -P -N c3h.input.sorted.bam

  • Map reads to the genome

bowtie2 –p 8 -x ~/Data/Genomes/mm10/bt2idx/bt2idx -1 c3h.input.1.fastq -2 c3h.input.2.fastq -S c3h.input.sam&

Bowtie2 options:

End-to-end alignment (default) vs. local alignment (--local)

            Paired-end alignment: -1 mate1.fq -2 mate2.fq

            Concordant + discordant alignment (default) vs. concordant only (--no-discordant)

            Mixed mode: paired where possible; otherwise unpaired (default) vs. paired-only mode (--no-mixed)

Overlapping with the other mate is concordant (default) vs. (--no-overlap)

Containing the other mate is concordant (default) vs. (--no-contain)

Dovetailing the other mate is discordant (default) vs. (--dovetail)

 

Bowtie2 output:

            Mapping quality: MAPQ

  • Peak analysis using MACS2

macs2 callpeak -t ~/Data/BCCRC/chipseq/mapping/tt2.k4m3.sam -c ~/Data/BCCRC/chipseq/mapping/tt2.input.sam -f SAM -g mm -n tt2.k4m3 -B -q 0.01 --outdir ~/Data/BCCRC/chipseq/tt2.k4m3.out

 

MACS2 options:

            -g (genome size, mm=1.87e9)

            -B/-bdg (bedGraph format for pile-up files)

            -q (The qvalue (minimum FDR) cutoff to call significant regions)

 

chip-seq samples normalization • 5.1k views
ADD COMMENTlink modified 3.1 years ago by Ying W3.7k • written 3.1 years ago by ritarebollo50
5
gravatar for Ying W
3.1 years ago by
Ying W3.7k
South San Francisco, CA
Ying W3.7k wrote:

I think it would be helpful if you tell us what is the biological question you are trying to answer

I am guessing what you are really trying to do is identify differential H3K4me3 occupancy between TT2 and C3H mice cell lines.

" The problem is the IP of sample B is sequenced at a bigger depth than sample A and it seems the peaks are not all being called. "

Did you mean IP-input of sample B is greater or less sequencing depth for IP of A? You mention that peaks in B are not being called when compared to previous results but you must keep in mind there is quite a bit of variability with chip-seq due to the number of factors that need to be controlled for.

Could you list # of raw reads and map % for your IP and IP-inputs for both samples?

 

Anyways, instead of trying to 'fix' the problem by mapping differently, normalization is typically done at the analysis stage. If you are interested in looking for differential histone occupancy, you will want to be using the diffpeak commands in macs2 or a more specialized tool such as RSEG

If you are interested in how normalization is done for ChIP-Seq, there are a couple papers out (and hopefully mine will be published eventually) such as NCIS and MAnorm

Here is some useful background reading:

http://www.ncbi.nlm.nih.gov/pubmed/22130887 "Analyzing ChIP-seq data: preprocessing, normalization, differential identification, and binding pattern characterization."

http://www.ncbi.nlm.nih.gov/pubmed/22499706 "Normalization, bias correction, and peak calling for ChIP-seq." (stat heavy)

http://www.ncbi.nlm.nih.gov/pubmed/24244136 "Practical guidelines for the comprehensive analysis of ChIP-seq data."

http://www.ncbi.nlm.nih.gov/pubmed/25223782 "Identifying and mitigating bias in next-generation sequencing methods for chromatin biology."

A quick search also turned up this recent paper (which I haven't read) that might be of interest to you

http://www.ncbi.nlm.nih.gov/pubmed/24598259 "Impact of sequencing depth in ChIP-seq experiments."

 

ADD COMMENTlink modified 3.1 years ago • written 3.1 years ago by Ying W3.7k
1
gravatar for Ian
3.1 years ago by
Ian5.0k
University of Manchester, UK
Ian5.0k wrote:

I run bowtie2 as you have and then filter the resulting SAM files:

samtools view -f 2 -q 30 -bS in.sam > out.bam

-f 2 retains only concordant read pairs, -q 30 retains reads with mapping score of at least Q30.  The latter means you can can have reads that map to multiple locations, but the mapping quality is very significant.

I recommend using the --SPMR flag in MACS2 which normalises the coverage plot values by millions of reads.  This allows you to compare visually between tracks if the samples are very different in terms of total reads.

ADD COMMENTlink written 3.1 years ago by Ian5.0k

Thanks for the reply!

After trying what you suggested, MACS2 failed with an error message "Too few paired peaks (0) so I can not build the model! Broader your MFOLD range parameter may erase this error. If it still can't build the model, we suggest to use --nomodel and --extsize 147 or other fixed number instead.” I checked the default MFOLD range and it is 5 - 50, which should be good enough. Any suggestions?

 

ADD REPLYlink written 3.1 years ago by ritarebollo50

The mfold range still relies upon the ChIP sample being adequately enriched, if not using --nomodel --extsize N will allow you to proceed.  I was unclear whether you had paired reads or not.  If you do then mfold is not required.  If you are working with single-end reads cross-correlation can fail (be mislead) if too many reads fall into blacklist regions (https://sites.google.com/site/anshulkundaje/projects/blacklists).  So removing such reads prior to peak calling can help.

ADD REPLYlink written 3.0 years ago by Ian5.0k
1
gravatar for pawelsm
3.1 years ago by
pawelsm10
Germany
pawelsm10 wrote:

Hi ritarebollo

1)

>Could you tell me if my pipeline is correct and if I should add a normalization step somewhere?

On bowtie2 step do you want to analyze any mapping reads or only uniquely mapping reads?

check this site - http://biofinysics.blogspot.de/2014/05/how-does-bowtie2-assign-mapq-scores.html

or just remove reads with "XS:i:" annotation by for example

grep -v XS:i: file.sam

Normalizing or scaling is not very safe as read coverage is not really linear.  You may consider generic quantile normalization.

 

2)

You may also use some software which allows for integration of biological replicates.  On of such packages is jamm (http://www.pubfacts.com/detail/25223640/JAMM:-A-Peak-Finder-for-Joint-Analysis-of-NGS-Replicates.).

cheers,

Pawel

 

ADD COMMENTlink written 3.1 years ago by pawelsm10

When i first started mapping PE reads I wanted to use uniquely mapped reads and used the "XS:i" method.  However after comparing this method and the mapping quality method (that i mention) the quality of the former was not as good (i.e. less binding region with good fold enrichment).  Also, grep'ing XS:i will leave you with a SAM file where there are orphaned reads of a pair (in my experience).

ADD REPLYlink written 3.1 years ago by Ian5.0k

Yes, no biological replicates here (fail... next time though). For the uniquely mapping reads to be fair, they are not a must for me now. I'd rather use the entire data and check later for the uniquely mapping reads to verify the peaks are still there. Some people also suggested to simply to absolute value comparisons between my two samples as I am simply look for presence or absence of peaks. But since the noise/background are so different between the samples, I am a bit lost. I'll keep trying though. Thanks!

ADD REPLYlink modified 3.1 years ago • written 3.1 years ago by ritarebollo50
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1020 users visited in the last hour