When I'm working with Mouse ChIP-seq data, I normally remove mapped reads which overlap the ENCODE blacklist regions. Previously there was no data for the mm10 assembly, so instead I would lift over the coordinates from the mm9 assembly (as described in the F1000 csaw article). When I do this I get 3,010 regions. However, recently ENCODE created a dataset for the mm10 assembly, but it only contains 164 regions. I contacted ENCODE to ask why and this was their response:
"LiftOver is not a good strategy for transferring blacklists across assemblies. Note that the blacklists are regions that show artifacts due to deficiencies in the genome assembly (e.g. unannotated repeats). So with a better assembly a region that was previously a blacklist wont be one any more. GRCh38 and mm10 have fewer detectable artifacts compared to GRCh37 and mm9 respectively because they are better and more complete assemblies e.g. repeats near centromeres and telomeres are better annotated. Hence the fewer regions. This blacklist release is also a first pass for mm10 and GRCh38 with minimal manual curation. We will be releasing additional refined versions in the future that may capture additional regions."
This made me doubt the advice given in the csaw paper, and my usual processing stages. I've always followed the advice from Heng Li that you should map to an un-masked genome. Then I usually remove the liftOver ENCODE blacklist regions. Instead should I use ENCODE's official mm10 blacklist, and then also remove predicted repeat regions from the UCSC genome annotation?