I have a long list of genomic coordinates in the format chromosome:position. Most or all of these could be expressed as intervals, ie chr:start-end, because the bases are consecutive. I can think of a bunch of approaches in R, but my list is 50 million lines long and I have a lot of them. Is there a fast way to do this?
If the input data looks like this:
1:501 1:502 1:503 1:634 1:635 1:636 8:9982 8:9983 8:9984 8:9985 etc
I would like the output to look like this:
1:501-503 1:634-636 8:9982-9985 etc
The input data is in order, and each line is unique. Any ideas? I'm open to R/data.table/bioconductor, command line tools like BEDtools etc, unix utilities like awk or whatever. I would prefer to avoid python as it's not present anywhere else in this workflow.