Question: Filter for regions containing large stretches of missing data
0
gravatar for mmmmcandrew
3.6 years ago by
mmmmcandrew100
mmmmcandrew100 wrote:

Hey all-

I have bigwig files containing scores for each base pair of the genome where I have sequencing coverage. I have pretty good coverage, but I have recently made some heatmaps aligning this data around TSS (+/- 3kb) and set the missing data color to yellow. For the vast majority of these 6 kb regions, I have very little or no yellow (missing) data. However, there are a few cases where I have large stretches of missing data within these 6 kb regions.

Is there a quick and easy way to apply a threshold for the tolerance of missing data? For instance, I would like to say "If there is a 500 bp stretch of missing data, remove this region from consideration." Any ideas?

Thanks!

ADD COMMENTlink modified 3.6 years ago by Alex Reynolds31k • written 3.6 years ago by mmmmcandrew100
2
gravatar for Devon Ryan
3.6 years ago by
Devon Ryan98k
Freiburg, Germany
Devon Ryan98k wrote:

We don't have a method to do that built into deepTools. One would need to program another script to filter this, which wouldn't be trivial without a for loop.

ADD COMMENTlink written 3.6 years ago by Devon Ryan98k
2
gravatar for Alex Reynolds
3.6 years ago by
Alex Reynolds31k
Seattle, WA USA
Alex Reynolds31k wrote:

Merge the per-base signal (scores):

$ bedops --merge per-base-signal.bed > merged-pbs.bed

Invert the merged regions to get the "mirror" image. This "mirror" is the set of gaps in between merged stretches of signal.

Assuming hg38, say:

$ fetchChromSizes hg38 | awk '{ print $1"\t0\t"$2; }' > hg38.bed
$ bedops --difference hg38.bed merged-pbs.bed > gaps-pbs.bed

This file gaps-pbs.bed contains the gaps between merged stretches of signal, as measured over the bounds of reference genome hg38.

Now map your TSS windows to those gaps with bedmap, using the --echo-map-size option to print out all the sizes of mapped elements (gaps). Pass this result to awk to test if any of the gap sizes are larger than 500 bases.

If any such gap qualifies, just skip the current line and go to the next line. If no such gap is found, then the TSS window is "okay" and gets written to standard output:

$ bedmap --echo --echo-map-size --fraction-map 1 tss-windows.bed gaps-pbs.bed | awk -F'|' '{ flag=0; n=split($2, a, ";"); for(i=1; i<=n; i++) { if (a[i] >= 500) { flag=1; break; } } if (flag==0) { print $1; } }' > filtered-tss-windows.bed

Using --fraction-map 1 ensures that any overlapping gap will be entirely contained within the TSS window. Gaps which fall off the edge of the window are not considered. This setting can be adjusted to be more lenient, if desired.

The file filtered-tss-windows.bed should now contain all TSS windows that do not entirely contain a gap greater than or equal to 500 bases in length.

ADD COMMENTlink modified 3.6 years ago • written 3.6 years ago by Alex Reynolds31k

The input file is a gzipped matrix with a json header, I don't think bedops will work for that.

ADD REPLYlink written 3.6 years ago by Devon Ryan98k

I'm just explaining one way to do things, if your inputs can be set up correctly. You may need to convert files to sorted BED with bigWigToWig, wig2bed, jq, sort-bed etc. to do set operations.

ADD REPLYlink modified 3.6 years ago • written 3.6 years ago by Alex Reynolds31k

Thanks very much guys!

ADD REPLYlink written 3.6 years ago by mmmmcandrew100
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2520 users visited in the last hour
_