Question: Extract rows having the 11th column values lies between 2nd and 3nd of a second file if 1st column matches
0
gravatar for i.jabre26
9 days ago by
i.jabre2610
i.jabre2610 wrote:

hello,

I have two files :

File1 :

chr5 20311169 20311244 5 20311177 20311251 K00230:40:HNWJLBBXX:4:1101:1002:35936 255 + -   6610258.00
chr5 26610220 26610295 5 26610221 26610296 K00230:40:HNWJLBBXX:4:1101:1022:24155 255 + -  220311210.00

File 2:

chr5   20311200    20311220   Nucleosome:1    110    5.0    39.9    MainPeak    1.43492858    0.68583064
chr5    801    861    Nucleosome:2    70    1.0    5.4    MainPeak    0.17076187    0.806538035
chr5    1021    1091    Nucleosome:3    80    2.0    14.4    MainPeak    0.42430331    0.481579895
chr5    1181    1251    Nucleosome:4    80    1.0    7.5    MainPeak    0.1362587    0.32626102999999995

I'm interested in printing rows from file 1 using a python code if the values of 11th column falls within the range start and end (2nd and 3rd columns )declared in the seconds file. As the position is only unique within a certain chromosome (chr) first it has to be tested if the chr's are identical... hence my desired output is

chr5 20311169 20311244 5 20311177 20311251 K00230:40:HNWJLBBXX:4:1101:1002:35936 255 + - 20311210.00

I have tried awk codes.. it works perfectly fine but they are very very slow !

The files I'm testing ( from which I need to print the rows are around 4 GB ).

I would highly appreciate if I can have some python or perl code

Thanks !

python perl • 126 views
ADD COMMENTlink modified 9 days ago • written 9 days ago by i.jabre2610
2

reformat File1 to generate a bed with awkand use bedtools intersect ...

ADD REPLYlink modified 9 days ago • written 9 days ago by Pierre Lindenbaum114k
1

I have tried awk codes

Show them please.

ADD REPLYlink modified 9 days ago • written 9 days ago by ATpoint9.3k
1

You are interested in:

I'm interested in printing rows from file 1 using a python code

you tried:

I have tried awk codes..

?

I'm interested in printing rows from file 1 if the values of 11th column falls within the range start and end (2nd and 3rd columns )declared in the seconds file.

values in 11th column of file 1 are 6610258.00 and 220311210.00. start and end coordinates in second file do not overlap at all with 11 column of file 1. Moreover, it is interesting to see that the coordinates in 11th column in file 1 are floats. Output last column value "20311210.00" doesn't appear in both the input files except in output.

All in all, this seems to be xy problem to me.

here is the python code for a logical problem:

Print all lines from file 2, when last (11th) column of file 1 is between start (2nd) and stop (3rd) coordinates in a second file:

file1:

$ cat test1.txt 
chr5    20311169    20311244    5   20311177    20311251    K00230:40:HNWJLBBXX:4:1101:1002:35936   255 +   -   6610258
chr5    26610220    26610295    5   26610221    26610296    K00230:40:HNWJLBBXX:4:1101:1022:24155   255 +   -   20311210

file 2:

$ cat test2.txt 
chr5    20311200    20311220    Nucleosome:1    110 5.0 39.9    MainPeak    1.43492858  0.68583064
chr5    801 861 Nucleosome:2    70  1.0 5.4 MainPeak    0.17076187  0.806538035
chr5    1021    1091    Nucleosome:3    80  2.0 14.4    MainPeak    0.42430331  0.481579895
chr5    1181    1251    Nucleosome:4    80  1.0 7.5 MainPeak    0.1362587   0.32626102999999995

code:

import os
import pandas as pd
from pandasql import sqldf as psql
test1=pd.read_csv("test1.txt", header=None, sep="\t")
test2=pd.read_csv("test2.txt", header=None, sep="\t")
test1.columns=["a","b","c","d",'e',"f","g","h","i","j","k"]
test2.columns=["l","m","n","o",'p',"q",'r','s','t',"u"]
psql ('select test1.* from test1 join test2 on test1.a=test2.l and test1.k between test2.m and test2.n')

output:

a   b   c   d   e   f   g   h   i   j   k

0   chr5    26610220    26610295    5   26610221    26610296    K00230:40:HNWJLBBXX:4:1101:1022:24155   255     +   -   20311210
ADD REPLYlink modified 9 days ago • written 9 days ago by cpad011210.0k

Thank you for sharing the code. It is helpful

ADD REPLYlink written 8 days ago by i.jabre2610
awk '
NR==FNR{ range[$1,$2,$3]; next }
FNR==1
{
    for(x in range) {
        split(x, check, SUBSEP); 
        if($1==check[1] && $11>=check[2] && $11<=check[3]) print $0
    }
}    
' file2 file1

This is the awk code I used.. It is working perfectly fine but it very slow !!!

ADD REPLYlink modified 9 days ago by RamRS19k • written 9 days ago by i.jabre2610
1

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.
code_formatting

ADD REPLYlink written 9 days ago by RamRS19k

Thank you for the information

ADD REPLYlink written 8 days ago by i.jabre2610

Python and perl will not be faster than awk...

ADD REPLYlink written 9 days ago by jrj.healey8.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 776 users visited in the last hour