Question: filter out multiple rows based on values in columns - perl and R attempt
2
gravatar for Illinu
4.6 years ago by
Illinu90
Belgium
Illinu90 wrote:

Similar to this post I want to filter out all the rows that contain zero value at all columns. I have a file with transcript counts for each sample+replicate and it turns out that some transcripts have 0 counts for all samples and replicates, and in other cases only one sample does not have zero counts but all the rest of the samples do, so what I want to do is to filter out:

1) all transcripts where there is zero counts for all samples and replicates

2) all transcripts where there is zero counts for all samples except one (e.g., A and B but not C, A and C but not B, B and C but not A)

For example, input:

  A_rep1 A_rep2 B_rep1 B_rep2 C_rep1 C_rep2
s1 0 6 5 3 0 9
s2 66 0 5 32 8 0
s3 0 0 0 0 0 0
s4 8 22 0 4 5 5

Output of task 1):

  A_rep1 A_rep2 B_rep1 B_rep2 C_rep1 C_rep2
s1 0 6 5 3 0 9
s2 66 0 5 32 8 0
s4 8 22 0 4 5 5

I've been trying in a number of ways to automate the process instead of doing it manually in Excel buy filtering. So my first attempt was in R. For the first task it works well but then when I need to parse the file to process the other tasks it doesn't work.

data=read.table('genes.counts.matrix', header=T)
set1 <- as.matrix(data[,-1])
row.names(set1)<- data[,1]
all <- apply(set1, 1, function(x) all(x[1:16]==0))
newdata <- set1[!all,]
write.table(newdata, "genes.counts.matrix.modified", sep="\t")

#also my problem here is that the output places the headers from column1 but the headers should go on top of the counts and not start at the transcript column. It looks like this

A_rep1 A_rep2 B_rep1 B_rep2 C_rep1 C_rep2  
s1 0 6 5 3 0 9

Then I tried with a oneliner perl but it is not working

perl -a -nle 'print if "$F[1-16] != 0" ' genes.counts.matrix > genes.counts.matrix.modified

or

perl -a -nle 'print if "$F[1]:$F[16] != 0" ' genes.counts.matrix > genes.counts.matrix.modified

My idea is to filter out first when all rows are equal to zero, next when rows from 1:12 are equal to zero, next when rows 1:4 and 9:16 are equal to zero, next when rows 1:8 and 12:16 are equal to zero and finally when rows 5:16 are equal to zero

This was my attempt in R and it didn't work

> all <- apply(set1, 1, function(x) (all(x[1:16]==0) | all(x[4:16]==0) | all(x[1:12]==0) | (all(x[1:4]==0) & all(x[9:16]==0)) | (all(x[1:8]==0) & all(x[12:16]==0))))
> newdata <- set1[!all,]

 

Linu

ADD COMMENTlink modified 3.3 years ago by Biostar ♦♦ 20 • written 4.6 years ago by Illinu90
3
gravatar for komal.rathi
4.6 years ago by
komal.rathi3.4k
Children's Hospital of Philadelphia, Philadelphia, PA
komal.rathi3.4k wrote:

I am surprised the last line of code did not work for you, I have removed a pair of braces & followed what you want, apart from that I changed nothing and it is working on my sample data:

"# My idea is to filter out first when all rows are equal to zero, 
# next when rows from 1:12 are equal to zero, 
# next when rows 1:4 and 9:16 are equal to zero, 
# next when rows 1:8 and 12:16 are equal to zero and 
# finally when rows 5:16 are equal to zero"

all <- apply(set1, 1, function(x) all(x==0) | all(x[1:12]==0) | (all(x[1:4]==0) & all(x[9:16]==0)) | (all(x[1:8]==0) & all(x[12:16]==0)) | all(x[5:16]==0))
newdata <- set1[!all,]

 

As for the headers, write out the output like this, so that when you open the output in any other application, the headers are not messed up:

newdata = cbind(samplenames = rownames(newdata), newdata)
write.table(newdata,'results.txt',quote = F,sep='\t',row.names = F,col.names = T)

 

ADD COMMENTlink modified 4.6 years ago • written 4.6 years ago by komal.rathi3.4k

Yes!! your bracket changes worked!! I still get the headers moved to the left but I am more than satisfied with having the function to work. Thanks a million!!

ADD REPLYlink modified 4.6 years ago • written 4.6 years ago by Illinu90

Illinu I have edited my answer to account for the problem with the header.

ADD REPLYlink written 4.6 years ago by komal.rathi3.4k

This is not working, I have two outcomes:

#This one applies the functions but messes up the headers

> data=read.table('genes.counts.matrix', header=T, row.names=1)
> all <- apply(data, 1, function(x) all(x==0) | all(x[1:12]==0) | (all(x[1:4]==0) & all(x[9:16]==0)) | (all(x[1:8]==0) & all(x[13:16]==0)) | all(x[5:16]==0))
> newdata <- data[!all,]
> write.table(newdata, "genes.counts.matrix.modifiedWithR", sep="\t")

 

# This one returns the file with the header in place but it does not apply the function

> data=read.table('genes.counts.matrix', header=T, row.names=1)
> set1 <- as.matrix(data[,-1])
> set1 = cbind(transcripts = rownames(set1), set1)
> all <- apply(set1, 1, function(x) all(x==0) | all(x[1:12]==0) | (all(x[1:4]==0) & all(x[9:16]==0)) | (all(x[1:8]==0) & all(x[13:16]==0)) | all(x[5:16]==0))
> newdata <- set1[!all,]
> write.table(newdata, "genes.counts.matrix.modifiedWithR",quote = F,sep='\t',row.names = F,col.names = T)
ADD REPLYlink modified 4.6 years ago • written 4.6 years ago by Illinu90

Umm, of course it will not apply the function! the order of r commands is incorrect. Apply the function first, then use cbind() and then write it! My bad, I used 'set1' as the name of file to write out instead of 'newdata'.

I have made that change. Just follow the order like I have shown.

ADD REPLYlink modified 4.6 years ago • written 4.6 years ago by komal.rathi3.4k

Excellent, that worked!! thanks a million

ADD REPLYlink written 4.5 years ago by Illinu90
1
gravatar for Giovanni M Dall'Olio
4.6 years ago by
London, UK
Giovanni M Dall'Olio26k wrote:

A trick in R is to convert any dataframe to a "long" or tidy format:

> library(tidyr)
> library(dplyr)
> biost
  individual A_rep1 A_rep2 B_rep1 B_rep2 C_rep1 C_rep2
1         s1      0      6      5      3      0      9
2         s2     66      0      5     32      8      0
3         s3      0      0      0      0      0      0
4         s4      8     22      0      4      5      5
> biost.tidy = biost %>% 
    gather(rep, value, -individual) %>% 
    separate(rep, into=c('sample', 'replica'), sep='_')

> biost.tidy
   individual sample replica value
1          s1      A    rep1     0
2          s2      A    rep1    66
3          s3      A    rep1     0
4          s4      A    rep1     8
5          s1      A    rep2     6
6          s2      A    rep2     0
7          s3      A    rep2     0
8          s4      A    rep2    22
9          s1      B    rep1     5
10         s2      B    rep1     5
11         s3      B    rep1     0
12         s4      B    rep1     0
13         s1      B    rep2     3
14         s2      B    rep2    32
15         s3      B    rep2     0
16         s4      B    rep2     4
17         s1      C    rep1     0
18         s2      C    rep1     8
19         s3      C    rep1     0
20         s4      C    rep1     5
21         s1      C    rep2     9
22         s2      C    rep2     0
23         s3      C    rep2     0
24         s4      C    rep2     5

 

Here biost.tidy is your data frame in a "tidy" format, in which each combination of individual, sample and replica is on a separate line. This format will make it much easier to plot the data frame with ggplot, or to calculate means by individual or sample. See http://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html for more info.

For example you can calculate the number of not-null samples for each individual:

> biost.count = biost.tidy %>% 
    group_by(individual) %>% 
    summarise(
       tot_notnull = sum(value!=0)
    )

Source: local data frame [4 x 3]

  individual tot_notnull 
1         s1           4
2         s2           4
3         s3           0                                   
4         s4           5

 

This tells you that individual s3 has no value different from 0, while s1 and s2 have 4 not-null values each.

This other call will give you the number of samples in which the individual has at least one non-null score:

biost.tidy %>% group_by(individual, sample) %>% summarise(non_null=sum(value!=0)) %>% group_by(individual) %>% summarise(n=sum(non_null>0))

Source: local data frame [4 x 2]

  individual n
1         s1 3
2         s2 3
3         s3 0
4         s4 3

This tells you that for individuals s1,s2, and s3, all three samples have at least one replica > 0. Modify the last condition (sum(non_null)>1) to get the number of individuals were both replicas are non-null.

You can use either of these datasets to subset your initial data frame:

> biost.tidy %>% subset(individual %in% subset(biost.count, tot_notnull>2)$individual)
   individual sample replica value
1          s1      A    rep1     0
2          s2      A    rep1    66
4          s4      A    rep1     8
5          s1      A    rep2     6
6          s2      A    rep2     0
8          s4      A    rep2    22
9          s1      B    rep1     5
10         s2      B    rep1     5
12         s4      B    rep1     0
13         s1      B    rep2     3
14         s2      B    rep2    32
16         s4      B    rep2     4
17         s1      C    rep1     0
18         s2      C    rep1     8
20         s4      C    rep1     5
21         s1      C    rep2     9
22         s2      C    rep2     0
24         s4      C    rep2     5

Or, if you prefer, you can subset your initial data frame:

> biost %>% subset(individual %in% subset(biost.count, tot_notnull>2)$individual)
  individual A_rep1 A_rep2 B_rep1 B_rep2 C_rep1 C_rep2
1         s1      0      6      5      3      0      9
2         s2     66      0      5     32      8      0
4         s4      8     22      0      4      5      5

For the problem of reading row names, try:

read.table('yourmatrix.csv', header=T, row.names=1)
ADD COMMENTlink modified 4.6 years ago • written 4.6 years ago by Giovanni M Dall'Olio26k
1

I think the OP is facing a problem with the headers not while reading the output in R, but opening the output in some kind of application like Excel. 

ADD REPLYlink written 4.6 years ago by komal.rathi3.4k

This is a great piece of information for my R knowledge. Thank you for the time explaining everything.

The row.names did not fix the displacement of the headers. In summary doing the following gets the correct number of rows filtered as I wanted but I still see the headers starting from column 1 instead of 2.

> data=read.table('genes.counts.matrix', header=T, row.names=1)
> all <- apply(data, 1, function(x) all(x==0) | all(x[1:12]==0) | (all(x[1:4]==0) & all(x[9:16]==0)) | (all(x[1:8]==0) & all(x[13:16]==0)) | all(x[5:16]==0))
> newdata <- data[!all,]
> write.table(newdata, "genes.counts.matrix.modifiedWithR", sep="\t")
ADD REPLYlink modified 4.6 years ago • written 4.6 years ago by Illinu90
0
gravatar for Nicolas Rosewick
4.6 years ago by
Belgium, Brussels
Nicolas Rosewick8.1k wrote:

try this ?

Filter 1:
data.filtered <- data[rowSums(data)!=0,]

Filter 2:

data.tmp <- data

data.tmp[data.tmp>0] <- 1

data.filtered2 <- data[rowSums(data.tmp)>1,]
ADD COMMENTlink modified 4.6 years ago • written 4.6 years ago by Nicolas Rosewick8.1k

Filter 1 worked but I had to change it to:

data.filtered <- data[rowSums(set1)!=0,]

Filter 2 didn't work well for me. I followed what komal.rathi suggested and it worked.

But thanks a lot for your suggestion

 

ADD REPLYlink modified 4.6 years ago • written 4.6 years ago by Illinu90
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1514 users visited in the last hour