Count lines where occurrences not occurred
2
1
Entering edit mode
7.7 years ago
waqasnayab ▴ 250

Hi,

I have a file (file1.txt):

AHR_si  liver
AHR_si  liver
AHR_si  liver
AHR_si  large_intestine
AHR_si  liver
AHR_si  large_intestine
AHR_si  liver
AHR_si  skin
AHR_si  liver
AHR_si  pancreas

then the file continues.......

I want to count the number of occurrences of cervix (not shown here as the file1.txt is showing as head -n10) appears against each column1. So:

grep cervix file1.txt | uniq -c

  1 AIRE_f2 cervix
  1 ARI3A_do    cervix
  1 FOSB_f1 cervix
  1 FOXQ1_f1    cervix
  1 HEN1_si cervix
  1 HNF4G_f1    cervix
  1 JUNB_f1 cervix
  1 NFAC1_do    cervix
  1 NR2F6_f1    cervix
  1 PTF1A_f1    cervix
  1 ZN350_f1    cervix

The above is the total output. As you can see there is not even a single occurrence of AHR_si with cervix. But I still I want the output like this:

  0 AHR_si   cervix
  1 AIRE_f2 cervix
  1 ARI3A_do    cervix
  1 FOSB_f1 cervix
  1 FOXQ1_f1    cervix
  1 HEN1_si cervix
  1 HNF4G_f1    cervix
  1 JUNB_f1 cervix
  1 NFAC1_do    cervix
  1 NR2F6_f1    cervix
  1 PTF1A_f1    cervix
  1 ZN350_f1    cervix

There are many other column1 value where cervix did not matched. So, I want in my output those lines as well with occurrence of zero.

Thanks in advance,

Waqas.

genome next-gen sequence • 1.2k views
ADD COMMENT
1
Entering edit mode

OK, so what have you tried other than command line tools, which won't do what you want?

ADD REPLY
0
Entering edit mode

I tried only the command line.

ADD REPLY
0
Entering edit mode

Use R, in particular the dplyr example that Giovanni posted.

ADD REPLY
1
Entering edit mode

Never do a uniq without a sort first. grep | sort | uniq

ADD REPLY
5
Entering edit mode
7.7 years ago

In bash you would have to use gawk, setting an array to count the occurrency of each sample name.

But for complicated tasks I would prefer to use R:

> install.packages("dplyr")
> library(dplyr)
> f = read.table("file1.txt")
> f
         V1              V2
1    AHR_si           liver
2    AHR_si           liver
3    AHR_si           liver
4    AHR_si large_intestine
5    AHR_si           liver
6    AHR_si large_intestine
7    AHR_si           liver
8    AHR_si            skin
9    AHR_si           liver
10   AHR_si        pancreas
11  AIRE_f2          cervix
12 ARI3A_do          cervix
13  FOSB_f1          cervix
14 FOXQ1_f1          cervix
15  HEN1_si          cervix
16 HNF4G_f1          cervix
17  JUNB_f1          cervix
18 NFAC1_do          cervix
19 NR2F6_f1          cervix
20 PTF1A_f1          cervix
21 ZN350_f1          cervix


> f %>% group_by(V1) %>% summarise(n.cervix=sum(grepl('cervix', V2)), n.liver=sum(grepl('liver', V2)))
# A tibble: 12 x 3
         V1 n.cervix n.liver
     <fctr>    <int>   <int>
1    AHR_si        0       6
2   AIRE_f2        1       0
3  ARI3A_do        1       0
4   FOSB_f1        1       0
5  FOXQ1_f1        1       0
6   HEN1_si        1       0
7  HNF4G_f1        1       0
8   JUNB_f1        1       0
9  NFAC1_do        1       0
10 NR2F6_f1        1       0
11 PTF1A_f1        1       0
12 ZN350_f1        1       0
ADD COMMENT
0
Entering edit mode
7.7 years ago
waqasnayab ▴ 250

thanks, thats great, its more what I wanted...,,,,!!!!

ADD COMMENT

Login before adding your answer.

Traffic: 2461 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6