Tool:Masc - A Tool For Calculating Mean Fragment Size For Chip-Seq Peak Calling Analysis
1
6
Entering edit mode
8.2 years ago
Ian 5.7k

MaSC: Mappability-Sensitive Cross-Correlation for Estimating Fragment Length

Published article describing tool:

http://bioinformatics.oxfordjournals.org/content/29/4/444

Authors' website for obtaining code:

http://www.perkinslab.ca/Software.html

The problem that this tool is addressing:

In ChIP-seq analysis the crucial step is 'peak calling' that identifies areas of the genome where sequence reads from the chromatin-immunoprecipitation (ChIP) pulldown are enriched compared to a control (typically genomic DNA that is not enriched by an antibody). The problem is that sequencers typically only read the start or end of input DNA fragments, which means the true extent of DNA bound by the protein of interest is not known. This is important when trying to determine regions of the genome that are bound in the experiment.

The majority of peak callers (and there are many, see Micsinai *et al.* 2012 for a recent review) either try to estimate the average length of fragments submitted to the sequencer, or require the user to directly input the value. The peak-caller then extends the reads (by various methods) to simulate the DNA fragments enriched by the ChIP process. The difference in predicted fragment length can produce different and poorly overlapping sets of peaks for the same sample. MACS 1.4.2 (as an example) sometimes predicts the mean fragment length to be only slightly longer than the read length, or it gives up and sets the length as 200bp.

With MaSC Ramachandran et al. (Perkins lab, OHRI) have identified "that the mappability of different parts of the genome can introduce an artificial bias" when calculating fragment length.

A useful description of "mappability" can be found [here].

Results:

MaSC produces believable results in my hands and i shall certainly be running it in the future to compare its prediction of fragment length to that from a peak caller. The major caveat is that mappability data is required for this tool to work.

Notes on running MaSC:

MaSC is a simple to use Perl script, so should be easy to run on most systems. I personally had to download the Perl module 'GDGraph', which appeared to be distinct from 'GD' that i already had installed.

The usefulness of MaSC really depends on whether mappability data is available for your genome. The easiest way to find out is to go to the UCSC Genome Browser and find the relevant genome. Mappability (spelt mapability by the UCSC browser team) can be found in the 'Mapping and Sequencing Tracks' section of the browser track settings. If you click on the heading you will be taken to the 'Mapability Track settings', you can then select 'Downloads' and select the mappability data that matches the length of sequence reads used in your experiment, e.g. 36bp.

After downloading the bigWig file you will need to convert it in to WIG format, moreover as one file per chromosome. Thankfully you can download a script from UCSC called 'bigWigToWig'. The command line looks something like this:

bigWigToWig -chrom=chr1 wgEncodeCrgMapabilityAlign36mer_mm9.bigWig mm9_36mer_chr1.map
bigWigToWig -chrom=chr2 wgEncodeCrgMapabilityAlign36mer_mm9.bigWig mm9_36mer_chr2.map


EDIT: MaSC will now automatically convert the values in the WIG files obtained using the previous step!!! I have left the text for interest sake. However, the wig values are continuous and MaSC requires a binary 0 or 1 value, so any value less than 1 need to be set to 0. But here is a handy AWK command to do this:

awk 'OFS="\t" { temp = $4 }; { if ( temp <1 )$4=0 }; { print }' < mm9_36mer_chr1.map > mm9_36mer_binval_chr1.map
awk 'OFS="\t" { temp = $4 }; { if ( temp <1 )$4=0 }; { print }' < mm9_36mer_chr2.map > mm9_36mer_binval_chr2.map


All these files can be deposited in a directory that MaSC can read.

MaSC also requires the length of all the chromosomes, but again there is a handy tool from UCSC to do this 'fetchChromSizes.sh'.

The tool takes sequence reads in a variety of formats.

chip-seq Tool • 7.0k views
0
Entering edit mode

What did you mean by

"However, the wig values are continuous and MaSC requires a binary 0 or 1 value, so any value less than 1 need to be set to 0."

0
Entering edit mode
4.6 years ago
ATCG ▴ 300

I have ChIPseq data, single end 75bp

Based on gel analysis I estimated my fragment length to be 200bp( I know this is not accurate ) , and do not get peaks for most of my samples,

based on preliminary QC I did using deep tools I can see that I don't have enrichment ( plotFingerprint)

1.Based on my results from plotFingerprint, would having an accurate fragment length make any difference in the number of peaks that I obtain? 2.Would removing blacklisted regions of the genome make any difference?

3.Anythere suggestions ? Thanks!!

0
Entering edit mode

You should really ask this as a new question with a proper topic, this is not a typical forum-style thread but more akin to stackoverflow. See: https://www.biostars.org/info/about/