Question

Filter for regions containing large stretches of missing data

0

Entering edit mode

6.9 years ago

mmmmcandrew ▴ 190

Hey all-

I have bigwig files containing scores for each base pair of the genome where I have sequencing coverage. I have pretty good coverage, but I have recently made some heatmaps aligning this data around TSS (+/- 3kb) and set the missing data color to yellow. For the vast majority of these 6 kb regions, I have very little or no yellow (missing) data. However, there are a few cases where I have large stretches of missing data within these 6 kb regions.

Is there a quick and easy way to apply a threshold for the tolerance of missing data? For instance, I would like to say "If there is a 500 bp stretch of missing data, remove this region from consideration." Any ideas?

Thanks!

deeptools computematrix plotheatmap missing data • 1.8k views

ADD COMMENT • link updated 6.9 years ago by Alex Reynolds 35k • written 6.9 years ago by mmmmcandrew ▴ 190

score 2 · Accepted Answer · 2017-06-09

2

Entering edit mode

6.9 years ago

Devon Ryan 104k

We don't have a method to do that built into deepTools. One would need to program another script to filter this, which wouldn't be trivial without a for loop.

ADD COMMENT • link 6.9 years ago by Devon Ryan 104k

score 2 · Accepted Answer · 2017-06-09

Merge the per-base signal (scores):

$ bedops --merge per-base-signal.bed > merged-pbs.bed

Invert the merged regions to get the "mirror" image. This "mirror" is the set of gaps in between merged stretches of signal.

Assuming hg38, say:

$ fetchChromSizes hg38 | awk '{ print $1"\t0\t"$2; }' > hg38.bed
$ bedops --difference hg38.bed merged-pbs.bed > gaps-pbs.bed

This file gaps-pbs.bed contains the gaps between merged stretches of signal, as measured over the bounds of reference genome hg38.

Now map your TSS windows to those gaps with bedmap, using the --echo-map-size option to print out all the sizes of mapped elements (gaps). Pass this result to awk to test if any of the gap sizes are larger than 500 bases.

If any such gap qualifies, just skip the current line and go to the next line. If no such gap is found, then the TSS window is "okay" and gets written to standard output:

$ bedmap --echo --echo-map-size --fraction-map 1 tss-windows.bed gaps-pbs.bed | awk -F'|' '{ flag=0; n=split($2, a, ";"); for(i=1; i<=n; i++) { if (a[i] >= 500) { flag=1; break; } } if (flag==0) { print $1; } }' > filtered-tss-windows.bed

Using --fraction-map 1 ensures that any overlapping gap will be entirely contained within the TSS window. Gaps which fall off the edge of the window are not considered. This setting can be adjusted to be more lenient, if desired.

The file filtered-tss-windows.bed should now contain all TSS windows that do not entirely contain a gap greater than or equal to 500 bases in length.