Question: Change list of consecutive genomic coordinates to intervals?
1
gravatar for patrick.turko
12 months ago by
patrick.turko10 wrote:

I have a long list of genomic coordinates in the format chromosome:position. Most or all of these could be expressed as intervals, ie chr:start-end, because the bases are consecutive. I can think of a bunch of approaches in R, but my list is 50 million lines long and I have a lot of them. Is there a fast way to do this?

If the input data looks like this:

1:501
1:502
1:503
1:634
1:635
1:636
8:9982
8:9983
8:9984
8:9985
etc

I would like the output to look like this:

1:501-503
1:634-636
8:9982-9985
etc

The input data is in order, and each line is unique. Any ideas? I'm open to R/data.table/bioconductor, command line tools like BEDtools etc, unix utilities like awk or whatever. I would prefer to avoid python as it's not present anywhere else in this workflow.

sequence intervals R • 453 views
ADD COMMENTlink modified 12 months ago by Charles Plessy2.6k • written 12 months ago by patrick.turko10
3
gravatar for Pierre Lindenbaum
12 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum114k wrote:

using bedtools merge http://bedtools.readthedocs.io/en/latest/content/tools/merge.html

awk -F ':' '{printf("%s\t%d\t%d\n",$1,$2,1+$2);}'  input.txt|\
sort -t $'\t' -k1,1 -k2,2n |\
bedtools merge -i - |\
awk '{printf("%s:%d-%d\n",$1,$2,$3 -1);}'


1:501-503
1:634-636
8:9982-9985
ADD COMMENTlink modified 12 months ago • written 12 months ago by Pierre Lindenbaum114k

Thank you, this is very clean.

ADD REPLYlink written 12 months ago by patrick.turko10
1
gravatar for Kevin Blighe
12 months ago by
Kevin Blighe31k
Republic of Ireland
Kevin Blighe31k wrote:

Hi Patrick, assuming your data is in a file called MyData.intervals you can try these commands (mixture of sed, paste, and awk; saves results into a new file called MyNewData.intervals):

sed 's/:/\t/g' MyData.intervals | awk '{chr[NR]+=$1; bp[NR]+=$2} END {for (i=1; i<=NR; i++) if(chr[i]==chr[i+1] && ((bp[i]-1)!=bp[i-1])) print chr[i]":"bp[i]; else if (chr[i]==chr[i-1] && ((bp[i]+1)!=bp[i+1])) {print bp[i]}}' | paste -s -d'-\n' > MyNewData.intervals
cat MyNewData.intervals
1:501-503
1:634-636
8:9982-9985

Kevin

ADD COMMENTlink modified 12 months ago • written 12 months ago by Kevin Blighe31k
1
gravatar for Charles Plessy
12 months ago by
Charles Plessy2.6k
Japan
Charles Plessy2.6k wrote:

With Bioconductor:

> suppressPackageStartupMessages(library(GenomicRanges))
> reduce(
    GRanges(
      seqnames = c(1,1,1,1,1,1,8,8,8,8),
      ranges   = IRanges(
                   start = c(501,502,503,634,635,636,9982,9983,9984,9985),
                   width = 1)))
GRanges object with 3 ranges and 0 metadata columns:
      seqnames       ranges strand
         <Rle>    <IRanges>  <Rle>
  [1]        1 [ 501,  503]      *
  [2]        1 [ 634,  636]      *
  [3]        8 [9982, 9985]      *
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths
ADD COMMENTlink modified 12 months ago • written 12 months ago by Charles Plessy2.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2399 users visited in the last hour