Question: Overlap between ranges
0
gravatar for waqaskhokhar999
20 months ago by
waqaskhokhar999100 wrote:

I have two tab separated files, File_1 contains exonic (output by stringtie) information and its structure is like this:

e_id    chr strand  start   end rcount  ucount  mrcount cov cov_sd  mcov    mcov_sd
1   1   +   3631    3913    46  46  46  10.371  10.5056 10.371  10.5056
2   1   +   3996    4276    83  83  83  22.3559 4.7919  22.3559 4.7919
3   1   +   4486    4605    47  47  47  25.2333 7.4294  25.2333 7.4294
4   1   +   4706    5095    120 120 120 23.5718 3.9786  23.5718 3.9786

File_2 contains splice junctions (outout by STAR) information and its structure is like this:

chr start   end strand 
1   3914    3995    1
1   4277    4485    1   
1   4496    4505    1
1   4716    5075    1

* strand (0: undefined, 1: +, 2: -)

I am interested in script which first check chromosome number and then extract those lines in which start and end coordinates ($2 and $3) of file_2 lies within start and end coordinate ($4 and $5) of file_1, so the expected output will be overlapped rows from file_1 + rows from file_2. For example, start and end coordinate in third and fourth row of file_2 lies within third and fourth row of file_1 so the expected output will be:

e_id    chr strand  start   end rcount  ucount  mrcount cov cov_sd  mcov    mcov_sd chr start   end strand

3   1   +   4486    4605    47  47  47  25.2333 7.4294  25.2333 7.4294    1 4496    4505    1
4   1   +   4706    5095    120 120 120 23.5718 3.9786  23.5718 3.9786    1 4716    5075    1

Thanks in advance

rna-seq • 350 views
ADD COMMENTlink modified 20 months ago by AK2.0k • written 20 months ago by waqaskhokhar999100
2
gravatar for AK
20 months ago by
AK2.0k
Taipei
AK2.0k wrote:

One idea is to make use of bedtools intersect (assuming the files are all tab separated):

# Pretend that they're in bed format
cut -f2,4,5 file_1 > file_1.bed
cut -f-3 file_2 > file_2.bed

# Use the starts and ends returned by bedtools intersect to query the original rows
grep -w -f <(bedtools intersect -a file_1.bed -b file_2.bed -wa | cut -f2,3) file_1 > file_1.intersect
grep -w -f <(bedtools intersect -a file_1.bed -b file_2.bed) file_2 > file_2.intersect

# Combine them
paste file_1.intersect file_2.intersect

Which will return:

3   1   +   4486    4605    47  47  47  25.2333 7.4294  25.2333 7.4294  1   4496    4505    1
4   1   +   4706    5095    120 120 120 23.5718 3.9786  23.5718 3.9786  1   4716    5075    1

But you might have to check if there are duplicated start and end from different chr in file_1... by cut -f4,5 file_1 | sort | uniq -c.

ADD COMMENTlink modified 20 months ago • written 20 months ago by AK2.0k

Many thanks for your efforts, is it possible to place the matching rows from file_2 exactly in front of matching rows of file_1, currently its pasting all rows from file_1 and matched rows from file_2

ADD REPLYlink written 18 months ago by waqaskhokhar999100

Try swapping the files when you paste.

ADD REPLYlink written 18 months ago by AK2.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1680 users visited in the last hour
_