Question: extract lines with multiple conditions on multiple fields/column
0
gravatar for Sam
6 months ago by
Sam20
canada
Sam20 wrote:

I have a big file contains a huge amount of SNP data, I like to search multiple columns ( here column 3 to 5) and filter any lines contain values between 0.5 to 1.

    SNP NN1 NN2 NN3
1   AG  0.0 0.0 1.0
2   CC  0.2 0.0 2.0
3   CC  0.3 0.5 0.0
4   AA  0.5 0.5 1.0
5   GG  0.4 0.0 0.0

expected output;

    SNP NN1 NN2 NN3
1   AG  0.0 0.0 1.0
3   CC  0.3 0.5 0.0
4   AA  0.5 0.5 1.0

I could find awk scripts that use a single field but not multiple field/column for query. ex; awk -F "," '$3 >= 0.5 && $3 <= 1' input.csv > out.csv

Any help will be appreciated. Thanks a lot.

awk sed grep • 274 views
ADD COMMENTlink modified 6 months ago by Arup Ghosh2.2k • written 6 months ago by Sam20
1

How to add images to a Biostars post

Edit by ATpoint : You have to use the image button (right of 10101 in the format bar), pasting the full path/link to the picture including the suffix so https://i.ibb.co/92CjrYk/bst.png instead of the link you originally posted.

ADD REPLYlink modified 6 months ago by ATpoint30k • written 6 months ago by WouterDeCoster43k

Thanks, Arup for formatting the data. Also it would be helpful to know how to paste excel data on Biostar effectively.

not sure why the data looks messy though it looks good on the preview

ADD REPLYlink modified 6 months ago • written 6 months ago by Sam20

You can share data snippets in Github and link that to biostars thread.

ADD REPLYlink written 6 months ago by Arup Ghosh2.2k
1

Grouping the conditions and adding the "or" operator || should do the trick

awk -F "," '($3 >= 0.5 && $3 <= 1) || ($4 >= 0.5 && $4 <= 1) || ($5 >= 0.5 && $5 <= 1)' input.csv > out.csv

ADD REPLYlink written 6 months ago by Sus20

Thanks, Sus... But I have > 100 columns, is that possible to search a range of columns as a condition. for example. choose between $3 to $100

something like this ... awk -F "," '($3:$100 >= 0.5 && $3:$100 <= 1) input.csv > out.csv
ADD REPLYlink written 6 months ago by Sam20
1

I suggest you embed this in a for loop within awk with an option to exit the look (= go to next line) if any of the conditions is not fulfilled, basically iterating from the 3rd to the nth column.

ADD REPLYlink written 6 months ago by ATpoint30k

ADD REPLYlink modified 6 months ago • written 6 months ago by Sam20
7
gravatar for Arup Ghosh
6 months ago by
Arup Ghosh2.2k
India
Arup Ghosh2.2k wrote:

The following awk command will loop through all the columns.

awk -F"," 'NR==1; NR > 1{for(i=2; i <= NF; i++) if($i >= 0.5 && $i <=1) {print; next}}' data.csv

Input: data.csv

SNP,NN1,NN2,NN3
AG,0.0,0.0,1.0
CC,0.2,0.0,2.0
CC,0.3,0.5,0.0
AA,0.5,0.5,1.0
GG,0.4,0.0,0.0

Output:

SNP,NN1,NN2,NN3
AG,0.0,0.0,1.0
CC,0.3,0.5,0.0
AA,0.5,0.5,1.0
ADD COMMENTlink written 6 months ago by Arup Ghosh2.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1520 users visited in the last hour