Question

filtering out lines (rows) based on repeated zero values with awk in linux

0

Entering edit mode

5.4 years ago

rgescudero ▴ 30

I want to filter out lines having zero values in more than 70% of the columns. Imagine I have the following “test_awk.txt” file

id sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 sample9 sample10
gene1 1       2       3       4       5       6       0       0       0       0
gene2 0       0       0       0       0       0       0       0       0       0
gene3 0       0       0       0       1       0       0       0       0       0
gene4 0       0       0       10      0       10      0       0       0       0
gene5 0       0       0       0       0       0       0       0       0       0
gene6 10      10      9       9       9       9       9       9       9       9
gene7 8       8       8       8       8       8       8       8       8       8
gene8 0       0       0       0       1       1       1       0       0       0
gene9 0       0       0       0       0       1       1       1       1       1

I would like to remove lines like “gene2”, “gene3”, “gene4”, “gene5”, and “gene8” because they have zero values in more than 7 coulmns out of 10. My reallife file is to big to run it in R, so I’m trying to use "awk" but I’m getting stack Any help would be much appreciated

Ramon

filter out awk zero values • 1.5k views

ADD COMMENT • link updated 5.4 years ago by shenwei356 8.7k • written 5.4 years ago by rgescudero ▴ 30

0

Entering edit mode

To make sure that 0 in gene names is not counted, I added gene10 entry copying gene 9 values and changing gene9 to gene 10.

$ awk 'NR==1 {print}; NR !=1 {print $1, $1="", $0, gsub(/0/, "")}' file.txt | awk -v OFS="\t" 'NR==1 {print};NR!=1 && $NF<7 {NF--;print}'

id  sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 sample9 sample10
gene1   1   2   3   4   5   6   0   0   0   0
gene6   10  10  9   9   9   9   9   9   9   9
gene7   8   8   8   8   8   8   8   8   8   8
gene9   0   0   0   0   0   1   1   1   1   1
gene10  0   0   0   0   0   1   1   1   1   1

ADD REPLY • link 5.4 years ago by cpad0112 21k

score 1 · Accepted Answer · 2020-02-17

1

Entering edit mode

5.4 years ago

Pierre Lindenbaum 166k

not tested.

awk '{N=0.0;for(i=2;i<=NF;i++) if($i=="0") N++; if(N/(NF-1)<0.7) print;}' input.txt

ADD COMMENT • link 5.4 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Thanks a lot Pierre Lindenbaum: it is working. Great!!! Many thanks!!!

ADD REPLY • link 5.4 years ago by rgescudero ▴ 30

0

Entering edit mode

mark the solution as answered please. Green tick on the left.

ADD REPLY • link 5.4 years ago by Pierre Lindenbaum 166k