Question: Removal of repeat and blacklist regions from ChIP-seq data
4
gravatar for James Ashmore
2.3 years ago by
James Ashmore2.6k
UK/Edinburgh/MRC Centre for Regenerative Medicine
James Ashmore2.6k wrote:

When I'm working with Mouse ChIP-seq data, I normally remove mapped reads which overlap the ENCODE blacklist regions. Previously there was no data for the mm10 assembly, so instead I would lift over the coordinates from the mm9 assembly (as described in the F1000 csaw article). When I do this I get 3,010 regions. However, recently ENCODE created a dataset for the mm10 assembly, but it only contains 164 regions. I contacted ENCODE to ask why and this was their response:

"LiftOver is not a good strategy for transferring blacklists across assemblies. Note that the blacklists are regions that show artifacts due to deficiencies in the genome assembly (e.g. unannotated repeats). So with a better assembly a region that was previously a blacklist wont be one any more. GRCh38 and mm10 have fewer detectable artifacts compared to GRCh37 and mm9 respectively because they are better and more complete assemblies e.g. repeats near centromeres and telomeres are better annotated. Hence the fewer regions. This blacklist release is also a first pass for mm10 and GRCh38 with minimal manual curation. We will be releasing additional refined versions in the future that may capture additional regions."

This made me doubt the advice given in the csaw paper, and my usual processing stages. I've always followed the advice from Heng Li that you should map to an un-masked genome. Then I usually remove the liftOver ENCODE blacklist regions. Instead should I use ENCODE's official mm10 blacklist, and then also remove predicted repeat regions from the UCSC genome annotation?

chip-seq encode blacklist • 2.1k views
ADD COMMENTlink modified 2.3 years ago by Devon Ryan90k • written 2.3 years ago by James Ashmore2.6k
6
gravatar for Devon Ryan
2.3 years ago by
Devon Ryan90k
Freiburg, Germany
Devon Ryan90k wrote:

We've usually taken the approach of just removing peaks near blacklisted regions (we've been using a list from the DKFZ with ~400 regions) when doing peak calling and otherwise just instructing tools to ignore those regions when that's possible (e.g., in deepTools).

As an aside, I agree with the ENCODE response regarding the problematic nature of lifting over blacklisted regions. As an example, only 10% of the problematic regions in hg18 were still problematic in hg19. Somewhere Heng Li has a presentation showing further improvements in GRCh38. I would presume this is the same for the mm8 -> mm9 -> mm10 progression of releases.

ADD COMMENTlink modified 2.3 years ago • written 2.3 years ago by Devon Ryan90k
1

Is there any reason you keep the mapped reads instead of filtering them out at the earliest convenience? According to this article by Carroll et al, removal of blacklist regions can help improve cross-correlation analysis e.t.c

ADD REPLYlink written 2.3 years ago by James Ashmore2.6k
1

It's faster to remove the peaks than to remove the reads. Our QC programs are blacklist-aware, so we also don't need to actually bother removing the reads for that either. As an aside, we find strand cross-correlation largely worthless.

ADD REPLYlink written 2.3 years ago by Devon Ryan90k

Thank you for your replies. To your point, I usually use cross-correlation to infer the fragment size. I use the calculated value for peak-calling and analyses within deeptools which allow fragment size settings. In your opinion is there a more accurate method to infer fragment size, or it has so little effect on the results that it's not actually that important to get perfectly correct?

ADD REPLYlink written 2.2 years ago by James Ashmore2.6k

I wouldn't worry about it being exactly correct, you could just use your bioanalyzer output.

ADD REPLYlink written 2.2 years ago by Devon Ryan90k

For data generated by collaborators that's my preferred method. Unfortunately the bioanalyzer output isn't always given with the public data / article.

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by James Ashmore2.6k

Can you share the DKFZ list you mentioned? I've looked at the initial mm10 release and the mm9 release (lifted over to mm10), and there a quite a lot of high signal regions not covered by the initial mm10 release. Right now I'm just using both the mm9/mm10 release but this blacklists ~1000 regions.

ADD REPLYlink written 2.2 years ago by James Ashmore2.6k
1

Let me first double check that that's what people are still using internally. If so, I guess I'll email it to you since I haven't a clue if they want it spread around or not.

ADD REPLYlink written 2.2 years ago by Devon Ryan90k
1

I can email it to you from the office tomorrow (not sure whether the DKFZ folks want it distributed widely or not). What's a good email address to use for you (I assume firstname.lastname@ed.ac.uk would work)?

ADD REPLYlink written 2.2 years ago by Devon Ryan90k

Great, many thanks (actually s1437643@sms.ed.ac.uk would be better)

ADD REPLYlink written 2.2 years ago by James Ashmore2.6k

Would you mind sharing the DKFZ blacklist with me as well? Many thanks. myusername[at]ymail.com

ADD REPLYlink written 8 months ago by Firas0
1

Public versions now exist

ADD REPLYlink written 8 months ago by Devon Ryan90k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 825 users visited in the last hour