how to window non-continuous intervals
1
0
Entering edit mode
9.7 years ago
R ▴ 10

I want to bin 550bp intervals into 11 50bp-windows to count tags over each 50bp window. Some of the intervals are continuous, therefore binning is easy like:

chr1    3660929 3661479  550  A

but most of them are not continuous but 550bp in total like

chr1    5074482 5074644    162  B
chr1    5079090 5079192    102  B
chr1    5083444 5083533    89   B
chr1    5085696 5085809    113  B
chr1    5088110 5088194    84   B

In this case what would be the best way to make 11 50bp-windows out of these non-continuous intervals? Any suggestion?

ChIP-Seq • 1.9k views
ADD COMMENT
2
Entering edit mode
9.7 years ago

One simplistic approach could be to bedmap per-base signal across the interval, binning it every 50 bp regardless of whether the bin came from one or more subintervals. This treats your interval as contiguous, even if it isn't.

First, "melt" your 550bp interval into per-base elements:

$ awk '{ \
    chr = $1; \
    start = $2; \
    stop = $3; \
    for (binStart = start; binStart < stop; binStart++) { \
        print chr"\t"binStart"\t"(binStart + 1)"\tidx-"NR; \
    } \
}' interval.bed > per_base.bed

Then map per-base tag counts (signal) to bases (this assumes you have a per-base tag "pile-up" file called tags.bed):

$ bedmap --echo --echo-map-score --delim '\t' per_base.bed tags.bed > per_base_tags.bed

This gives you a mapping of a base to the number of tags at that base. If there are no tags, there is no score value (assume a score of 0 tags at that base, or post-process with awk to put in a zero).

Now sum the score column from this file, 50 contiguous rows at a time, to get the number of tags over each of your 50 bp bins that span the interval.

This demo only works with one interval. In the real world, you will need to generalize this process to deal with multiple intervals that may overlap.

One way to do this is to add a unique per-interval ID to each subinterval in interval.bed and modify the awk script to print this ID to the per-base output. You can then use UNIX sort on the ID column to group your per-base bedmap results by the original interval they came from. When you loop over the ID-sorted results, you can grab 50 contiguous rows at a time and sum up the tags to get your 11 bins.

ADD COMMENT
0
Entering edit mode

Thank you very much for your useful reply. I tried to make per-base tag "pile-up" as follow, but I think it is not correct, would you please have a look at my command:

samtools mpileup -l interval.bed sorted.bam | cut -f1,2 | intervalBed > pileup.bed

But for the first interval it returns

chr1    5074482 5074483 idx-1   5079090.000000
chr1    5074483 5074484 idx-1   5079090.000000
chr1    5074484 5074485 idx-1   5079090.000000
chr1    5074485 5074486 idx-1   5079090.000000
chr1    5074486 5074487 idx-1   5079090.000000
chr1    5074487 5074488 idx-1   5079090.000000
chr1    5074488 5074489 idx-1   5079090.000000
chr1    5074489 5074490 idx-1   5079090.000000
chr1    5074490 5074491 idx-1   5079090.000000
...
chr1    5074490 5074491 idx-1   5079090.000000
ADD REPLY
0
Entering edit mode

Honestly, I'm not too familiar with using samtools to calculate pile-up tracks. Definitely seems like a good question to ask the site, though!

ADD REPLY
0
Entering edit mode
cat Test | \
  awk '
    BEGIN {OFS="\t"}
    {
      if ($6 == "+") {
        print $1,$2-10,$2+10,$4,$5,$6
      }
      else if ($6 == "-") {
        print $1,$3-10,$3+10,$4,$5,$6
      }
    }' | \
  awk '
    BEGIN {OFS="\t"}
    {
      chr=$1;
      start=$2;
      end=$3;
      wind=0;
      for (binStart= start; binStart<end; binStart++) {
        win++;
        print chr,binStart, binStart+2,$4"_window"win,$5,$6
      }
      win=0;
    }' | \
  bedtools getfasta \
  -fi ~/Data/ucsc/goldenpath/danRer7/danRer7.fa \
  -bed stdin \
  -s \
  -name \
  -tab \
  -fo testOutput

Do you an easy way how I could make a tabular matrix (with 1 column and 20 rows) using bashSripts, instead of perl/R that i use to make it now.

more testOutput
rna_window1     CC
rna_window2     CG
rna_window3     GC
rna_window4     CC
rna_window5     CC
rna_window6     CT
rna_window7     TT
rna_window8     TC
rna_window9     CA
rna_window10    AT
rna_window11    TT
rna_window12    TC
rna_window13    CA
rna_window14    AC
rna_window15    CC
rna_window16    CC
rna_window17    CA
rna_window18    AC
rna_window19    CA
rna_window20    AT
rna_window1     CA
rna_window2     AC

RequiredOutputFormat

rna_window1 CC CG CC CC CT.................upto20column
ADD REPLY

Login before adding your answer.

Traffic: 2353 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6