Question

Intermediate range calculation from files

0

Entering edit mode

7.7 years ago

User 6777 ▴ 20

Hi all,

I have started to learn perl and python but now I am completely stuck in this problem, thus I seek your help.

I have seven files with different number ranges. I want to compare their ranges and detect the common range from them. Below I have shown an example with three files (file1.txt, file2.txt anf file3.txt). These files are like:

file1.txt:

68476204: 9-50, 55-75, 80-132
NC_23987: 2-22, 1001-1085
68473073: 1-8
68485121: 1-10, 20-55

file2.txt:

68485121: 15-45
45905121: 2-98, 201-255
68476204: 8-30, 57-77, 88-180
NC_23987: 1-18, 1021-1055
68473073: 14-44

file3.txt:

68485121: 16-42
68476204: 8-22, 55-76, 81-118

From here, I want to generate two output. First one is the common ranges (common in all three) after matching left column id values. For the above input, my output1.txt will be:

68485121: 20-42
68476204: 9-22, 57-75, 88-118

The second output (output2.txt) contain only those ranges those are >=15. Here, the output2.txt will be:

68485121: 20-42
68476204: 57-75, 88-118

Any type of suggestion is appreciated.

Thanks

perl python • 1.7k views

ADD COMMENT • link updated 13 months ago by Ram 43k • written 7.7 years ago by User 6777 ▴ 20

0

Entering edit mode

How you calculated output1.txt? Could you explain ?

ADD REPLY • link 7.7 years ago by second_exon ▴ 210

score 0 · Answer 1 · 2016-08-31

Convert your text files to BED files, sort them with BEDOPS sort-bed and run BEDOPS bedops --intersect on them to get the intervals common to them.

For example, file1.txt:

68476204: 9-50, 55-75, 80-132
NC_23987: 2-22, 1001-1085
68473073: 1-8
68485121: 1-10, 20-55

becomes file1.bed (when sorted):

68473073   1    8
68476204   9    50
68476204   55   75
68476204   80   132
68485121   1    10
68485121   20   55
NC_23987   2    22
NC_23987   1001 1085

And so on.

To convert from text to BED, you could use a Python script:

#!/usr/bin/env python                                                                                                                                                                   

import sys

for line in sys.stdin:
    (chr, intervals_str) = line.rstrip('\n').split(':')
    for interval in intervals_str.replace(' ', '').split(','):
        (start, stop) = interval.split('-')
        sys.stdout.write('%s\t%s\t%s\n' % (chr, start, stop))

Then:

$ convert.py < file1.txt | sort-bed - > file1.bed

Etc.

Once you have BED files, you can do set operations on the sorted BED files:

$ bedops --intersect file1.bed file2.bed file3.bed > answer.bed

Once you have the intersection of intervals, you can filter that result with awk based on interval length:

$ awk '($3-$2 >= 15)' answer.bed > filtered_answer.bed

Etc.