Question

Overlap between ranges

0

Entering edit mode

6.1 years ago

waqaskhokhar999 ▴ 160

I have two tab separated files, File_1 contains exonic (output by stringtie) information and its structure is like this:

e_id    chr strand  start   end rcount  ucount  mrcount cov cov_sd  mcov    mcov_sd
1   1   +   3631    3913    46  46  46  10.371  10.5056 10.371  10.5056
2   1   +   3996    4276    83  83  83  22.3559 4.7919  22.3559 4.7919
3   1   +   4486    4605    47  47  47  25.2333 7.4294  25.2333 7.4294
4   1   +   4706    5095    120 120 120 23.5718 3.9786  23.5718 3.9786

File_2 contains splice junctions (outout by STAR) information and its structure is like this:

chr start   end strand 
1   3914    3995    1
1   4277    4485    1   
1   4496    4505    1
1   4716    5075    1

* strand (0: undefined, 1: +, 2: -)

I am interested in script which first check chromosome number and then extract those lines in which start and end coordinates ($2 and $3) of file_2 lies within start and end coordinate ($4 and $5) of file_1, so the expected output will be overlapped rows from file_1 + rows from file_2. For example, start and end coordinate in third and fourth row of file_2 lies within third and fourth row of file_1 so the expected output will be:

e_id    chr strand  start   end rcount  ucount  mrcount cov cov_sd  mcov    mcov_sd chr start   end strand

3   1   +   4486    4605    47  47  47  25.2333 7.4294  25.2333 7.4294    1 4496    4505    1
4   1   +   4706    5095    120 120 120 23.5718 3.9786  23.5718 3.9786    1 4716    5075    1

Thanks in advance

RNA-Seq • 1.5k views

ADD COMMENT • link updated 6.1 years ago by AK ★ 2.2k • written 6.1 years ago by waqaskhokhar999 ▴ 160

score 2 · Accepted Answer · 2019-06-05

2

Entering edit mode

6.1 years ago

AK ★ 2.2k

One idea is to make use of bedtools intersect (assuming the files are all tab separated):

# Pretend that they're in bed format
cut -f2,4,5 file_1 > file_1.bed
cut -f-3 file_2 > file_2.bed

# Use the starts and ends returned by bedtools intersect to query the original rows
grep -w -f <(bedtools intersect -a file_1.bed -b file_2.bed -wa | cut -f2,3) file_1 > file_1.intersect
grep -w -f <(bedtools intersect -a file_1.bed -b file_2.bed) file_2 > file_2.intersect

# Combine them
paste file_1.intersect file_2.intersect

Which will return:

3   1   +   4486    4605    47  47  47  25.2333 7.4294  25.2333 7.4294  1   4496    4505    1
4   1   +   4706    5095    120 120 120 23.5718 3.9786  23.5718 3.9786  1   4716    5075    1

But you might have to check if there are duplicated start and end from different chr in file_1... by cut -f4,5 file_1 | sort | uniq -c.

ADD COMMENT • link 6.1 years ago by AK ★ 2.2k

0

Entering edit mode

Many thanks for your efforts, is it possible to place the matching rows from file_2 exactly in front of matching rows of file_1, currently its pasting all rows from file_1 and matched rows from file_2

ADD REPLY • link 6.0 years ago by waqaskhokhar999 ▴ 160

0

Entering edit mode

Try swapping the files when you paste.

ADD REPLY • link 6.0 years ago by AK ★ 2.2k