Question: Chip-Seq Remove Centromeres
gravatar for Ying W
8.7 years ago by
Ying W4.0k
South San Francisco, CA
Ying W4.0k wrote:

This might have more to do with assembly parameters but by visual inspection of my chip-seq data it seems like a lot reads mapping near centromere and telomeric regions. I was wondering if there was a way to remove these reads before calling peaks because I am getting a lot of false peaks at these regions.

I am using bwa to align (default settings)


Here is a picture of what it looks like: mypicture full image The top two tracks are input, The next 3 tracks are chip-seq results, red is microsattlite, next 3 are MACS bed output

assembly chip-seq • 5.1k views
ADD COMMENTlink modified 8.7 years ago by Ido Tamir5.0k • written 8.7 years ago by Ying W4.0k

How do you know the peaks are false? Have you tried playing with the -n parameter of bwa samse? (i.e. choose -n 1)

ADD REPLYlink written 8.7 years ago by seidel7.0k

Are you using a input DNA (or other) control set of reads in your peak calling tool? Quite often such regions appear in both samples, so negating each other.

ADD REPLYlink written 8.7 years ago by Ian5.6k

I am using input DNA for peak calling but they still call the peaks, some peaks with very very high input are ignored so it seems to be working but other peaks w/high input still are called.

ADD REPLYlink written 8.6 years ago by Ying W4.0k
gravatar for Casey Bergman
8.7 years ago by
Casey Bergman18k
Athens, GA, USA
Casey Bergman18k wrote:

First, the centromeres are not included in the reference human genome assembly. Te are unclonable and unsequencable by traditional genome sequencing methods. This is why you have a blank patch of no reads mapping in the center chromosome 7 (a metacentric -- "middle center" -- chromosome). The centromere is a physical gap in the assembly and is represented as a long stretch of NNNN's.

What you are actually seeing is reads preferentially mapping to the peri-centormeric region. In many species, the pericentromeric region is enriched in transposable elements and simple sequence repeats, because these regions are gene-poor and/or have low recombination rates. Therefore, what is probably going on here is that you have reads mapping to repetitive sequences in the peri-centromeric region. Why this would happening with BWA on defaults, which should map repetitive reads randomly to one of the repeats, is a bit curious. Neverlethess, to see if this solves your problem, try filtering to remove all repetitive reads on the XT tag of the BWA output.

ADD COMMENTlink written 8.7 years ago by Casey Bergman18k

I added a grep -v "XT:A:R" to my pipe and it took out a bit of the noise around pericentromeric regions. removing these reads seems to do better than samtools rmdup. Using samtools -q 30 since it is low coverage i pretty much lose all my peaks but the pericentromeric region is also all gone. Hope this info helps someone too

ADD REPLYlink written 8.6 years ago by Ying W4.0k

thanks for the clarification, I will try this and post results. do you think rmdup in picard would help also?

ADD REPLYlink written 8.6 years ago by Ying W4.0k
gravatar for Ido Tamir
8.7 years ago by
Ido Tamir5.0k
Ido Tamir5.0k wrote:

I am expanding a little bit Caseys explanation. You see the enrichment there because the official assembly contains them just once - where they are still divergent enough - but in the actual sequenced genome its present multiple times. Then all of these get mapped to the same location uniquely (or just once randomly - but better check your settings again), and you see this accumulation.

Since its just 23 chromosomes I would create a catalogue of regions where you see extreme enrichment in the input manually (or with IGB you can create a track above a certain treshhold) and filter the reads first by these regions before I put them into a peak finder (e.g. bedtools: intersectBed -a reads.bed -b regions.bed -v). This will ensure that reads from these regions don't screw up the statistics (e.g. background distribution, total number of reads, "highest signals").

What is strange is that your input tracks seem to have higher enrichment than your IPs. Something is wrong there (experimentally).

ADD COMMENTlink written 8.7 years ago by Ido Tamir5.0k

I did not run input on my samples, I am using the input from ENCODE (same cell line)

ADD REPLYlink written 8.6 years ago by Ying W4.0k

This is not enough. I think you have multiple technical issues there: a) The ENCODE input looks a bit strange. But this can be just visual - from the aggregation method and display or from the mapping (repeats on/off). One would have to zoom in in some regions to check if you have amplification artifacts there. There are some enrichments that seem to get higher with Dex - which should not happen with an "input sample". But chromosome scale is difficult to interpret.

ADD REPLYlink written 8.6 years ago by Ido Tamir5.0k

b) Did you check a positive control in your data yet? Do you see an enrichment in a region you know there should be something? Is this TF or histone? I don't really see enrichment appropriate for most TFs that I know off at this scale.

ADD REPLYlink written 8.6 years ago by Ido Tamir5.0k

Could you elaborate more on why the ENCODE looks strange? The ENCODE track is A439 cell line w/DEX and ETOH. Since this is a low coverage preliminary run, none of the positive controls look like they have peaks but we are in the process of validating some of the peaks right now. We ran chip-seq on a non-DNA binding TF. Removing the reads with XT:A:R tag removes some of regions with high input so it is possible that the ENCODE data did not filter out the high input regions but I have not gone back to check this yet.

ADD REPLYlink written 8.6 years ago by Ying W4.0k

I am repeating myself: as pointed out, one would have to zoom in to really check how these enrichments look like, but it seems as if some of them are responding to DEX, which an input track should not. An input track can have enrichments at repetitive elements and some other regions, but here it looks like the enrichment correlates with gene dense regions. Your own IP track looks like the ideal input track, rather flat with some occasional peaks (on this scale). I think you should invest in your own input track (it does not have to be +/- DEX).

ADD REPLYlink written 8.6 years ago by Ido Tamir5.0k

I don't know your current depht, but the first Chip-Seq papers were published with 4-5 million reads, far from saturation as we now know, but still good enough to detect many peaks. You should a) check your samples with direct PCR if you see an enrichment b) think about wheter you really expect to see an enrichment if you sequence the same samples deeper or if you rather prepare them again.

ADD REPLYlink written 8.6 years ago by Ido Tamir5.0k

i didnt see that you replied, there are enrichment differences in sample between DEX and non-DEX treatment from encode and that is a bit troubling but I dont know what to make of it (there should be no differences) the chip-seq data i am showing is preliminary just to see if the antibody worked so low # of reads (mid-3M reads off the sequencer and low-2M map after QC and processing) we are in the process of checking some of the reads w/chip-qpcr and pending the sucess of that we plan on running more reads (30-40M range)

ADD REPLYlink written 8.5 years ago by Ying W4.0k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1168 users visited in the last hour