Question

Count lines where occurrences not occurred

1

Entering edit mode

7.7 years ago

waqasnayab ▴ 250

Hi,

I have a file (file1.txt):

AHR_si  liver
AHR_si  liver
AHR_si  liver
AHR_si  large_intestine
AHR_si  liver
AHR_si  large_intestine
AHR_si  liver
AHR_si  skin
AHR_si  liver
AHR_si  pancreas

then the file continues.......

I want to count the number of occurrences of cervix (not shown here as the file1.txt is showing as head -n10) appears against each column1. So:

grep cervix file1.txt | uniq -c

  1 AIRE_f2 cervix
  1 ARI3A_do    cervix
  1 FOSB_f1 cervix
  1 FOXQ1_f1    cervix
  1 HEN1_si cervix
  1 HNF4G_f1    cervix
  1 JUNB_f1 cervix
  1 NFAC1_do    cervix
  1 NR2F6_f1    cervix
  1 PTF1A_f1    cervix
  1 ZN350_f1    cervix

The above is the total output. As you can see there is not even a single occurrence of AHR_si with cervix. But I still I want the output like this:

  0 AHR_si   cervix
  1 AIRE_f2 cervix
  1 ARI3A_do    cervix
  1 FOSB_f1 cervix
  1 FOXQ1_f1    cervix
  1 HEN1_si cervix
  1 HNF4G_f1    cervix
  1 JUNB_f1 cervix
  1 NFAC1_do    cervix
  1 NR2F6_f1    cervix
  1 PTF1A_f1    cervix
  1 ZN350_f1    cervix

There are many other column1 value where cervix did not matched. So, I want in my output those lines as well with occurrence of zero.

Thanks in advance,

Waqas.

genome next-gen sequence • 1.2k views

ADD COMMENT • link 7.7 years ago by waqasnayab ▴ 250

1

Entering edit mode

OK, so what have you tried other than command line tools, which won't do what you want?

ADD REPLY • link 7.7 years ago by Devon Ryan 104k

0

Entering edit mode

I tried only the command line.

ADD REPLY • link 7.7 years ago by waqasnayab ▴ 250

0

Entering edit mode

Use R, in particular the dplyr example that Giovanni posted.

ADD REPLY • link 7.7 years ago by Devon Ryan 104k

1

Entering edit mode

Never do a uniq without a sort first. grep | sort | uniq

ADD REPLY • link 7.7 years ago by Giovanni M Dall'Olio 28k

0

Entering edit mode

7.7 years ago

waqasnayab ▴ 250

thanks, thats great, its more what I wanted...,,,,!!!!

ADD COMMENT • link 7.7 years ago by waqasnayab ▴ 250

score 5 · Accepted Answer · 2016-08-09

In bash you would have to use gawk, setting an array to count the occurrency of each sample name.

But for complicated tasks I would prefer to use R:

> install.packages("dplyr")
> library(dplyr)
> f = read.table("file1.txt")
> f
         V1              V2
1    AHR_si           liver
2    AHR_si           liver
3    AHR_si           liver
4    AHR_si large_intestine
5    AHR_si           liver
6    AHR_si large_intestine
7    AHR_si           liver
8    AHR_si            skin
9    AHR_si           liver
10   AHR_si        pancreas
11  AIRE_f2          cervix
12 ARI3A_do          cervix
13  FOSB_f1          cervix
14 FOXQ1_f1          cervix
15  HEN1_si          cervix
16 HNF4G_f1          cervix
17  JUNB_f1          cervix
18 NFAC1_do          cervix
19 NR2F6_f1          cervix
20 PTF1A_f1          cervix
21 ZN350_f1          cervix


> f %>% group_by(V1) %>% summarise(n.cervix=sum(grepl('cervix', V2)), n.liver=sum(grepl('liver', V2)))
# A tibble: 12 x 3
         V1 n.cervix n.liver
     <fctr>    <int>   <int>
1    AHR_si        0       6
2   AIRE_f2        1       0
3  ARI3A_do        1       0
4   FOSB_f1        1       0
5  FOXQ1_f1        1       0
6   HEN1_si        1       0
7  HNF4G_f1        1       0
8   JUNB_f1        1       0
9  NFAC1_do        1       0
10 NR2F6_f1        1       0
11 PTF1A_f1        1       0
12 ZN350_f1        1       0