Question: Change list of consecutive genomic coordinates to intervals?
1
gravatar for patrick.turko
28 days ago by
patrick.turko10 wrote:

I have a long list of genomic coordinates in the format chromosome:position. Most or all of these could be expressed as intervals, ie chr:start-end, because the bases are consecutive. I can think of a bunch of approaches in R, but my list is 50 million lines long and I have a lot of them. Is there a fast way to do this?

If the input data looks like this:

1:501
1:502
1:503
1:634
1:635
1:636
8:9982
8:9983
8:9984
8:9985
etc

I would like the output to look like this:

1:501-503
1:634-636
8:9982-9985
etc

The input data is in order, and each line is unique. Any ideas? I'm open to R/data.table/bioconductor, command line tools like BEDtools etc, unix utilities like awk or whatever. I would prefer to avoid python as it's not present anywhere else in this workflow.

sequence intervals R • 215 views
ADD COMMENTlink modified 28 days ago by Charles Plessy2.3k • written 28 days ago by patrick.turko10
3
gravatar for Pierre Lindenbaum
28 days ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum101k wrote:

using bedtools merge http://bedtools.readthedocs.io/en/latest/content/tools/merge.html

awk -F ':' '{printf("%s\t%d\t%d\n",$1,$2,1+$2);}'  input.txt|\
sort -t $'\t' -k1,1 -k2,2n |\
bedtools merge -i - |\
awk '{printf("%s:%d-%d\n",$1,$2,$3 -1);}'


1:501-503
1:634-636
8:9982-9985
ADD COMMENTlink modified 28 days ago • written 28 days ago by Pierre Lindenbaum101k

Thank you, this is very clean.

ADD REPLYlink written 27 days ago by patrick.turko10
1
gravatar for Kevin Blighe
28 days ago by
Kevin Blighe7.2k
Republic of Ireland (Éire)
Kevin Blighe7.2k wrote:

Hi Patrick, assuming your data is in a file called MyData.intervals you can try these commands (mixture of sed, paste, and awk; saves results into a new file called MyNewData.intervals):

sed 's/:/\t/g' MyData.intervals | awk '{chr[NR]+=$1; bp[NR]+=$2} END {for (i=1; i<=NR; i++) if(chr[i]==chr[i+1] && ((bp[i]-1)!=bp[i-1])) print chr[i]":"bp[i]; else if (chr[i]==chr[i-1] && ((bp[i]+1)!=bp[i+1])) {print bp[i]}}' | paste -s -d'-\n' > MyNewData.intervals
cat MyNewData.intervals
1:501-503
1:634-636
8:9982-9985

Kevin

ADD COMMENTlink modified 28 days ago • written 28 days ago by Kevin Blighe7.2k
1
gravatar for Charles Plessy
28 days ago by
Charles Plessy2.3k
Japan
Charles Plessy2.3k wrote:

With Bioconductor:

> suppressPackageStartupMessages(library(GenomicRanges))
> reduce(
    GRanges(
      seqnames = c(1,1,1,1,1,1,8,8,8,8),
      ranges   = IRanges(
                   start = c(501,502,503,634,635,636,9982,9983,9984,9985),
                   width = 1)))
GRanges object with 3 ranges and 0 metadata columns:
      seqnames       ranges strand
         <Rle>    <IRanges>  <Rle>
  [1]        1 [ 501,  503]      *
  [2]        1 [ 634,  636]      *
  [3]        8 [9982, 9985]      *
  -------
  seqinfo: 2 sequences from an unspecified genome; no seqlengths
ADD COMMENTlink modified 28 days ago • written 28 days ago by Charles Plessy2.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1364 users visited in the last hour