How to merge overlapping ranges by R/Python?
2
1
Entering edit mode
5.3 years ago

I'm just a beginner in this area so I apologize first for wasting your time with this stupid question:

I recently downloaded some data from Uniprot and want to do some domain coverage analysis, but just realized that there are some overlapping ranges that needs to be merged.

For example, my data looks like this:

no. pfamseq_acc seq_version crc64   md5 pfamA_acc   seq_start   seq_end
1857719 I1HYZ2  1   E9A7BABADB78EF7A    217559545415239f73152e983a1e4226    PF00083 11  244
1857720 I1HYZ2  1   E9A7BABADB78EF7A    217559545415239f73152e983a1e4226    PF00083 482 737
1857722 A8I2W9  1   125E4A38BE3390D8    3be8943f5916ee0fd2c80adba0af3377    PF00083 83  299
1857723 A8I2W9  1   125E4A38BE3390D8    3be8943f5916ee0fd2c80adba0af3377    PF00083 286 500
1857724 A0A0L0MB57  1   277707E4DDF29006    cf70dbe1ad2c63df7e4b6d419c5dff05    PF00083 10  222
1857725 A0A0L0MB57  1   277707E4DDF29006    cf70dbe1ad2c63df7e4b6d419c5dff05    PF00083 223 423

And what I want is:

no. pfamseq_acc seq_version crc64   md5 pfamA_acc   seq_start   seq_end
1857719 I1HYZ2  1   E9A7BABADB78EF7A    217559545415239f73152e983a1e4226    PF00083 11  244
1857720 I1HYZ2  1   E9A7BABADB78EF7A    217559545415239f73152e983a1e4226    PF00083 482 737
1857722 A8I2W9  1   125E4A38BE3390D8    3be8943f5916ee0fd2c80adba0af3377    PF00083 83  500
1857724 A0A0L0MB57  1   277707E4DDF29006    cf70dbe1ad2c63df7e4b6d419c5dff05    PF00083 10  423

I have tried to write it in R which looks like:

library(dplyr)
setwd("C:/Users/hy/Documents/Desktop/test2")
d<- read.csv("test2.tsv", header = T, sep = "\t", dec = ".")

d %>%
group_by(md5) %>%
  arrange(md5,seq_start) %>%
  mutate(indx = c(0, cumsum(as.numeric(lead(seq_start)) >
                              cummax(as.numeric(seq_end)))[-n()])) %>%
  group_by(md5,pfamseq_acc,pfamA_acc)
summairse(start = min(seq_start), end = max(seq_end))

The "summarise" could give me the correct result, but I failed to write it in a new file.

So I would appreciate it if anyone could give some ideas about how to achieve that (preferably by R or Python), any other methods is also welcomed.

Thanks.

sequence merge overlap R python • 3.4k views
ADD COMMENT
3
Entering edit mode
5.3 years ago

You could use bedops --merge to merge intervals in a sorted BED file:

$ bedops --merge in.bed > out.bed

A BED file is just a tab-delimited text file, minimally three columns, which specify chromosome, start, and stop attributes of each interval, respectively. You can export such a file from R via write.table or the like.

Using a toolkit like this will likely be faster than using a library in R or Python.

ADD COMMENT
0
Entering edit mode

Thanks for your reply. However my data is in a tsv file. I have tried to do that with "merge" function in BedTools, but it would give error if I convert my file to a BED one.

ADD REPLY
1
Entering edit mode

Just fyi and for the future, "Give error" is not a helpful error message. Please be more precise so that people can reproduce the problem.

ADD REPLY
0
Entering edit mode

Try BEDOPS, maybe, with the example given. If you need to, use awk to reorder columns into a minimally three-column file, where the first three columns are: chromosome, start, and stop position.

ADD REPLY
3
Entering edit mode
5.3 years ago
ATpoint 82k

Using Unix tools and BEDtools, reading a file in.tsv:

## Write coordinates as BED with the remaining information as one string in column 4
awk 'OFS="\t" {if ($1 == "no.") next} {print $2, $7, $8, $1"--"$6"--"$3"--"$4"--"$5}' in.tsv | \

  ## sort
  sort -k1,1 -k2,2n -k3,3n | \

  ## save intermediate file for later intersection
  tee tmp_sorted.tsv | \

  ## merge overlapping intervals
  bedtools merge -i - | \

  ## intersect the merged interval with the intermediate file to restore additional information
  bedtools intersect -a - -b tmp_sorted.tsv -wa -wb | \

  ## cut out what is relevant
  cut -f1-3,7 | \

  ## get unique list
  sort -k1,1 -k2,2n -u | \

  ## write back to original format
  awk 'OFS="\t" {gsub("--","\t");print $4, $5, $6, $7, $8, $1, $2, $3}' | \

  ## add header
  cat <(echo 'no. pfamseq_acc seq_version crc64 md5 pfamA_acc seq_start seq_end' | tr " " "\t") /dev/stdin > out.tsv && rm tmp_sorted.tsv
ADD COMMENT
0
Entering edit mode

sorry, I tried your code just now, ans it gives error as follows:

Error: unable to open file or unable to determine types for file -

-Please ensure that your file hais TAB delimited (e.g., cat -t FILE) -Also ensure that your file has integer chromosome coordinates in the expected columns (e.g., cols 2 and 3 for BED)

I have tried to use cat -t instead of cat, but it gives the same result, and I think I do have a tab delimited file cause it's TSV.

ADD REPLY
0
Entering edit mode

Your files does not seem to be properly tab-separated. Can you upload it somewhere or share a link to the source?

ADD REPLY
0
Entering edit mode

OK I have a long story to tell you but I would say "Happy New Year!" and "appreciate it for your help!" at first.

I'm sorry that I forgot to tell you my file has about 160000 lines (although I don't think it's too big to read).

Today I changed a computer and succeeded to read the file. When I tried to run your script on a test file (head 100 of the big file), everything was OK and exactly I got what I need. However, when I tried to apply it to the big file, I got some random errors:

Why I say "random" is that, sometimes it doesn't give error but the output file only has 2500-3000 lines, and sometimes it says "ERROR: 4 fields were expected at line XXX but only 1/2/3 recognized". Sorry I couldn't remember the exact words but just something like that. Anyway, I failed to run it.

Then I turned to split the files and wrote like this:

split -l 10000 ./in.tsv -d -a 4 in_
for i in *; do mv $i $i".tsv"; done
for i in *; do awk 'OFS="\t" {if ($1 == "no.") next} {print $2, $7, $8, $1"--"$6"--"$3"--"$4"--"$5}' $i | sort -k1,1 -k2,2n -k3,3n | tee tmp_sorted_$i.tsv | bedtools merge -d 1 -i - | bedtools intersect -a - -b tmp_sorted_$i.tsv -wa -wb | cut -f1-3,7 | sort -k1,1 -k2,2n -u | awk 'OFS="\t" {gsub("--","\t");print $4, $5, $6, $7, $8, $1, $2, $3}' | cat <(tr " " "\t") /dev/stdin >> out.tsv && rm tmp_sorted_$i.tsv; done #modified on your code, just a loop added.

But every time I run the code above, the out.tsv would have different number of lines, which made me confusing, I thought maybe the split file was still too big so I tried 1000, and things are good.

Then I tried for another time, split by 500, and found it a little bit different between the two results. The difference arises because the overlapped domains may be split into different slices and failed to be merged together, I think. So I turned to "diff" function and corrected the results. The results were furtherly confirmed by splits of 1500, 2500, 1181 and 1669. Then I think I get what I need.

I don't know why the problem happened, maybe it's because of my computer, but I am working on Ubuntu 18.04, and my RAM is 32GB, I will check later.

Anyway, my work is majorly based on your code, thanks again for your kind help!

ADD REPLY

Login before adding your answer.

Traffic: 2491 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6