Question

Removal of repeat and blacklist regions from ChIP-seq data

5

Entering edit mode

7.2 years ago

James Ashmore ★ 3.4k

When I'm working with Mouse ChIP-seq data, I normally remove mapped reads which overlap the ENCODE blacklist regions. Previously there was no data for the mm10 assembly, so instead I would lift over the coordinates from the mm9 assembly (as described in the F1000 csaw article). When I do this I get 3,010 regions. However, recently ENCODE created a dataset for the mm10 assembly, but it only contains 164 regions. I contacted ENCODE to ask why and this was their response:

"LiftOver is not a good strategy for transferring blacklists across assemblies. Note that the blacklists are regions that show artifacts due to deficiencies in the genome assembly (e.g. unannotated repeats). So with a better assembly a region that was previously a blacklist wont be one any more. GRCh38 and mm10 have fewer detectable artifacts compared to GRCh37 and mm9 respectively because they are better and more complete assemblies e.g. repeats near centromeres and telomeres are better annotated. Hence the fewer regions. This blacklist release is also a first pass for mm10 and GRCh38 with minimal manual curation. We will be releasing additional refined versions in the future that may capture additional regions."

This made me doubt the advice given in the csaw paper, and my usual processing stages. I've always followed the advice from Heng Li that you should map to an un-masked genome. Then I usually remove the liftOver ENCODE blacklist regions. Instead should I use ENCODE's official mm10 blacklist, and then also remove predicted repeat regions from the UCSC genome annotation?

ChIP-Seq Blacklist ENCODE • 6.1k views

ADD COMMENT • link updated 3.2 years ago by Biostar 20 • written 7.2 years ago by James Ashmore ★ 3.4k

score 8 · Accepted Answer · 2017-02-22

8

Entering edit mode

7.2 years ago

Devon Ryan 104k

We've usually taken the approach of just removing peaks near blacklisted regions (we've been using a list from the DKFZ with ~400 regions) when doing peak calling and otherwise just instructing tools to ignore those regions when that's possible (e.g., in deepTools).

As an aside, I agree with the ENCODE response regarding the problematic nature of lifting over blacklisted regions. As an example, only 10% of the problematic regions in hg18 were still problematic in hg19. Somewhere Heng Li has a presentation showing further improvements in GRCh38. I would presume this is the same for the mm8 -> mm9 -> mm10 progression of releases.

ADD COMMENT • link 7.2 years ago by Devon Ryan 104k

1

Entering edit mode

Is there any reason you keep the mapped reads instead of filtering them out at the earliest convenience? According to this article by Carroll et al, removal of blacklist regions can help improve cross-correlation analysis e.t.c

ADD REPLY • link 7.2 years ago by James Ashmore ★ 3.4k

1

Entering edit mode

It's faster to remove the peaks than to remove the reads. Our QC programs are blacklist-aware, so we also don't need to actually bother removing the reads for that either. As an aside, we find strand cross-correlation largely worthless.

ADD REPLY • link 7.2 years ago by Devon Ryan 104k

0

Entering edit mode

Thank you for your replies. To your point, I usually use cross-correlation to infer the fragment size. I use the calculated value for peak-calling and analyses within deeptools which allow fragment size settings. In your opinion is there a more accurate method to infer fragment size, or it has so little effect on the results that it's not actually that important to get perfectly correct?

ADD REPLY • link 7.2 years ago by James Ashmore ★ 3.4k

0

Entering edit mode

I wouldn't worry about it being exactly correct, you could just use your bioanalyzer output.

ADD REPLY • link 7.2 years ago by Devon Ryan 104k

0

Entering edit mode

For data generated by collaborators that's my preferred method. Unfortunately the bioanalyzer output isn't always given with the public data / article.

ADD REPLY • link 7.2 years ago by James Ashmore ★ 3.4k

0

Entering edit mode

Can you share the DKFZ list you mentioned? I've looked at the initial mm10 release and the mm9 release (lifted over to mm10), and there a quite a lot of high signal regions not covered by the initial mm10 release. Right now I'm just using both the mm9/mm10 release but this blacklists ~1000 regions.

ADD REPLY • link 7.1 years ago by James Ashmore ★ 3.4k

1

Entering edit mode

Let me first double check that that's what people are still using internally. If so, I guess I'll email it to you since I haven't a clue if they want it spread around or not.

ADD REPLY • link 7.1 years ago by Devon Ryan 104k

1

Entering edit mode

I can email it to you from the office tomorrow (not sure whether the DKFZ folks want it distributed widely or not). What's a good email address to use for you (I assume firstname.lastname@ed.ac.uk would work)?

ADD REPLY • link 7.1 years ago by Devon Ryan 104k

0

Entering edit mode

Great, many thanks (actually s1437643@sms.ed.ac.uk would be better)

ADD REPLY • link 7.1 years ago by James Ashmore ★ 3.4k

0

Entering edit mode

Would you mind sharing the DKFZ blacklist with me as well? Many thanks. myusername[at]ymail.com