filtering out lines (rows) based on repeated zero values with awk in linux
1
0
Entering edit mode
4.3 years ago
rgescudero ▴ 30

I want to filter out lines having zero values in more than 70% of the columns. Imagine I have the following “test_awk.txt” file

id sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 sample9 sample10
gene1 1       2       3       4       5       6       0       0       0       0
gene2 0       0       0       0       0       0       0       0       0       0
gene3 0       0       0       0       1       0       0       0       0       0
gene4 0       0       0       10      0       10      0       0       0       0
gene5 0       0       0       0       0       0       0       0       0       0
gene6 10      10      9       9       9       9       9       9       9       9
gene7 8       8       8       8       8       8       8       8       8       8
gene8 0       0       0       0       1       1       1       0       0       0
gene9 0       0       0       0       0       1       1       1       1       1

I would like to remove lines like “gene2”, “gene3”, “gene4”, “gene5”, and “gene8” because they have zero values in more than 7 coulmns out of 10. My reallife file is to big to run it in R, so I’m trying to use "awk" but I’m getting stack Any help would be much appreciated

Ramon

filter out awk zero values • 1.1k views
ADD COMMENT
0
Entering edit mode

To make sure that 0 in gene names is not counted, I added gene10 entry copying gene 9 values and changing gene9 to gene 10.

$ awk 'NR==1 {print}; NR !=1 {print $1, $1="", $0, gsub(/0/, "")}' file.txt | awk -v OFS="\t" 'NR==1 {print};NR!=1 && $NF<7 {NF--;print}'

id  sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 sample9 sample10
gene1   1   2   3   4   5   6   0   0   0   0
gene6   10  10  9   9   9   9   9   9   9   9
gene7   8   8   8   8   8   8   8   8   8   8
gene9   0   0   0   0   0   1   1   1   1   1
gene10  0   0   0   0   0   1   1   1   1   1
ADD REPLY
1
Entering edit mode
4.3 years ago

not tested.

awk '{N=0.0;for(i=2;i<=NF;i++) if($i=="0") N++; if(N/(NF-1)<0.7) print;}' input.txt
ADD COMMENT
0
Entering edit mode

Thanks a lot Pierre Lindenbaum: it is working. Great!!! Many thanks!!!

ADD REPLY
0
Entering edit mode

mark the solution as answered please. Green tick on the left.

ADD REPLY

Login before adding your answer.

Traffic: 1401 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6