Question: filtering out lines (rows) based on repeated zero values with awk in linux
0
gravatar for rgescudero
12 months ago by
rgescudero30
Spain
rgescudero30 wrote:

I want to filter out lines having zero values in more than 70% of the columns. Imagine I have the following “test_awk.txt” file

id sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 sample9 sample10
gene1 1       2       3       4       5       6       0       0       0       0
gene2 0       0       0       0       0       0       0       0       0       0
gene3 0       0       0       0       1       0       0       0       0       0
gene4 0       0       0       10      0       10      0       0       0       0
gene5 0       0       0       0       0       0       0       0       0       0
gene6 10      10      9       9       9       9       9       9       9       9
gene7 8       8       8       8       8       8       8       8       8       8
gene8 0       0       0       0       1       1       1       0       0       0
gene9 0       0       0       0       0       1       1       1       1       1

I would like to remove lines like “gene2”, “gene3”, “gene4”, “gene5”, and “gene8” because they have zero values in more than 7 coulmns out of 10. My reallife file is to big to run it in R, so I’m trying to use "awk" but I’m getting stack Any help would be much appreciated

Ramon

zero values awk filter out • 236 views
ADD COMMENTlink modified 12 months ago by shenwei3565.8k • written 12 months ago by rgescudero30

To make sure that 0 in gene names is not counted, I added gene10 entry copying gene 9 values and changing gene9 to gene 10.

$ awk 'NR==1 {print}; NR !=1 {print $1, $1="", $0, gsub(/0/, "")}' file.txt | awk -v OFS="\t" 'NR==1 {print};NR!=1 && $NF<7 {NF--;print}'

id  sample1 sample2 sample3 sample4 sample5 sample6 sample7 sample8 sample9 sample10
gene1   1   2   3   4   5   6   0   0   0   0
gene6   10  10  9   9   9   9   9   9   9   9
gene7   8   8   8   8   8   8   8   8   8   8
gene9   0   0   0   0   0   1   1   1   1   1
gene10  0   0   0   0   0   1   1   1   1   1
ADD REPLYlink modified 12 months ago • written 12 months ago by cpad011215k
1
gravatar for Pierre Lindenbaum
12 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum134k wrote:

not tested.

awk '{N=0.0;for(i=2;i<=NF;i++) if($i=="0") N++; if(N/(NF-1)<0.7) print;}' input.txt
ADD COMMENTlink modified 12 months ago • written 12 months ago by Pierre Lindenbaum134k

Thanks a lot Pierre Lindenbaum: it is working. Great!!! Many thanks!!!

ADD REPLYlink written 12 months ago by rgescudero30

mark the solution as answered please. Green tick on the left.

ADD REPLYlink written 12 months ago by Pierre Lindenbaum134k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1539 users visited in the last hour
_