Question

Extract rows having the 11th column values lies between 2nd and 3nd of a second file if 1st column matches

0

Entering edit mode

5.5 years ago

i.jabre26 ▴ 10

hello,

I have two files :

File1 :

chr5 20311169 20311244 5 20311177 20311251 K00230:40:HNWJLBBXX:4:1101:1002:35936 255 + -   6610258.00
chr5 26610220 26610295 5 26610221 26610296 K00230:40:HNWJLBBXX:4:1101:1022:24155 255 + -  220311210.00

File 2:

chr5   20311200    20311220   Nucleosome:1    110    5.0    39.9    MainPeak    1.43492858    0.68583064
chr5    801    861    Nucleosome:2    70    1.0    5.4    MainPeak    0.17076187    0.806538035
chr5    1021    1091    Nucleosome:3    80    2.0    14.4    MainPeak    0.42430331    0.481579895
chr5    1181    1251    Nucleosome:4    80    1.0    7.5    MainPeak    0.1362587    0.32626102999999995

I'm interested in printing rows from file 1 using a python code if the values of 11th column falls within the range start and end (2nd and 3rd columns )declared in the seconds file. As the position is only unique within a certain chromosome (chr) first it has to be tested if the chr's are identical... hence my desired output is

chr5 20311169 20311244 5 20311177 20311251 K00230:40:HNWJLBBXX:4:1101:1002:35936 255 + - 20311210.00

I have tried awk codes.. it works perfectly fine but they are very very slow !

The files I'm testing ( from which I need to print the rows are around 4 GB ).

I would highly appreciate if I can have some python or perl code

Thanks !

python perl • 1.3k views

ADD COMMENT • link 5.5 years ago by i.jabre26 ▴ 10

2

Entering edit mode

reformat File1 to generate a bed with awkand use bedtools intersect ...

ADD REPLY • link 5.5 years ago by Pierre Lindenbaum 161k

1

Entering edit mode

I have tried awk codes

Show them please.

ADD REPLY • link 5.5 years ago by ATpoint 82k

1

Entering edit mode

You are interested in:

I'm interested in printing rows from file 1 using a python code

you tried:

I have tried awk codes..

?

I'm interested in printing rows from file 1 if the values of 11th column falls within the range start and end (2nd and 3rd columns )declared in the seconds file.

values in 11th column of file 1 are 6610258.00 and 220311210.00. start and end coordinates in second file do not overlap at all with 11 column of file 1. Moreover, it is interesting to see that the coordinates in 11th column in file 1 are floats. Output last column value "20311210.00" doesn't appear in both the input files except in output.

All in all, this seems to be xy problem to me.

here is the python code for a logical problem:

Print all lines from file 2, when last (11th) column of file 1 is between start (2nd) and stop (3rd) coordinates in a second file:

file1:

$ cat test1.txt 
chr5    20311169    20311244    5   20311177    20311251    K00230:40:HNWJLBBXX:4:1101:1002:35936   255 +   -   6610258
chr5    26610220    26610295    5   26610221    26610296    K00230:40:HNWJLBBXX:4:1101:1022:24155   255 +   -   20311210

file 2:

$ cat test2.txt 
chr5    20311200    20311220    Nucleosome:1    110 5.0 39.9    MainPeak    1.43492858  0.68583064
chr5    801 861 Nucleosome:2    70  1.0 5.4 MainPeak    0.17076187  0.806538035
chr5    1021    1091    Nucleosome:3    80  2.0 14.4    MainPeak    0.42430331  0.481579895
chr5    1181    1251    Nucleosome:4    80  1.0 7.5 MainPeak    0.1362587   0.32626102999999995

code:

import os
import pandas as pd
from pandasql import sqldf as psql
test1=pd.read_csv("test1.txt", header=None, sep="\t")
test2=pd.read_csv("test2.txt", header=None, sep="\t")
test1.columns=["a","b","c","d",'e',"f","g","h","i","j","k"]
test2.columns=["l","m","n","o",'p',"q",'r','s','t',"u"]
psql ('select test1.* from test1 join test2 on test1.a=test2.l and test1.k between test2.m and test2.n')

output:

a   b   c   d   e   f   g   h   i   j   k

0   chr5    26610220    26610295    5   26610221    26610296    K00230:40:HNWJLBBXX:4:1101:1022:24155   255     +   -   20311210

ADD REPLY • link 5.5 years ago by cpad0112 21k

0

Entering edit mode

Thank you for sharing the code. It is helpful

ADD REPLY • link 5.5 years ago by i.jabre26 ▴ 10

0

Entering edit mode

awk '
NR==FNR{ range[$1,$2,$3]; next }
FNR==1
{
    for(x in range) {
        split(x, check, SUBSEP); 
        if($1==check[1] && $11>=check[2] && $11<=check[3]) print $0
    }
}    
' file2 file1

This is the awk code I used.. It is working perfectly fine but it very slow !!!

ADD REPLY • link updated 5.5 years ago by Ram 43k • written 5.5 years ago by i.jabre26 ▴ 10

1

Entering edit mode

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.
code_formatting

ADD REPLY • link 5.5 years ago by Ram 43k

0

Entering edit mode

Thank you for the information

ADD REPLY • link 5.5 years ago by i.jabre26 ▴ 10

0

Entering edit mode

Python and perl will not be faster than awk...

ADD REPLY • link 5.5 years ago by Joe 21k