Question: Change list of consecutive genomic coordinates to intervals?
1
gravatar for patrick.turko
10 months ago by
patrick.turko10 wrote:

I have a long list of genomic coordinates in the format chromosome:position. Most or all of these could be expressed as intervals, ie chr:start-end, because the bases are consecutive. I can think of a bunch of approaches in R, but my list is 50 million lines long and I have a lot of them. Is there a fast way to do this?

If the input data looks like this:

1:501
1:502
1:503
1:634
1:635
1:636
8:9982
8:9983
8:9984
8:9985
etc

I would like the output to look like this:

1:501-503
1:634-636
8:9982-9985
etc

The input data is in order, and each line is unique. Any ideas? I'm open to R/data.table/bioconductor, command line tools like BEDtools etc, unix utilities like awk or whatever. I would prefer to avoid python as it's not present anywhere else in this workflow.

sequence intervals R • 420 views
ADD COMMENTlink modified 10 months ago by Charles Plessy2.5k • written 10 months ago by patrick.turko10
3
gravatar for Pierre Lindenbaum
10 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum112k wrote:

using bedtools merge http://bedtools.readthedocs.io/en/latest/content/tools/merge.html

awk -F ':' '{printf("%s\t%d\t%d\n",$1,$2,1+$2);}'  input.txt|\
sort -t $'\t' -k1,1 -k2,2n |\
bedtools merge -i - |\
awk '{printf("%s:%d-%d\n",$1,$2,$3 -1);}'


1:501-503
1:634-636
8:9982-9985
ADD COMMENTlink modified 10 months ago • written 10 months ago by Pierre Lindenbaum112k

Thank you, this is very clean.

ADD REPLYlink written 10 months ago by patrick.turko10
1
gravatar for Kevin Blighe
10 months ago by
Kevin Blighe28k
USA / Europe / Brazil
Kevin Blighe28k wrote:

Hi Patrick, assuming your data is in a file called MyData.intervals you can try these commands (mixture of sed, paste, and awk; saves results into a new file called MyNewData.intervals):

sed 's/:/\t/g' MyData.intervals | awk '{chr[NR]+=$1; bp[NR]+=$2} END {for (i=1; i<=NR; i++) if(chr[i]==chr[i+1] && ((bp[i]-1)!=bp[i-1])) print chr[i]":"bp[i]; else if (chr[i]==chr[i-1] && ((bp[i]+1)!=bp[i+1])) {print bp[i]}}' | paste -s -d'-\n' > MyNewData.intervals
cat MyNewData.intervals
1:501-503
1:634-636
8:9982-9985

Kevin

ADD COMMENTlink modified 10 months ago • written 10 months ago by Kevin Blighe28k
1
gravatar for Charles Plessy
10 months ago by
Charles Plessy2.5k
Japan
Charles Plessy2.5k wrote:

With Bioconductor:

> suppressPackageStartupMessages(library(GenomicRanges))
> reduce(
    GRanges(
      seqnames = c(1,1,1,1,1,1,8,8,8,8),
      ranges   = IRanges(
                   start = c(501,502,503,634,635,636,9982,9983,9984,9985),
                   width = 1)))
GRanges object with 3 ranges and 0 metadata columns:
      seqnames       ranges strand
         <Rle>    <IRanges>  <Rle>
  [1]        1 [ 501,  503]      *
  [2]        1 [ 634,  636]      *
  [3]        8 [9982, 9985]      *
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths
ADD COMMENTlink modified 10 months ago • written 10 months ago by Charles Plessy2.5k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1666 users visited in the last hour