Question: How To Remove Rows In A Matrix Having 0 As Value In Columns In Both Conditions
1
ivivek_ngs4.9k wrote:

I have few queries. I am having a matrix with some counts. there are 2 conditions for each. So I have a matrix with first column as genes and 12 columns with counts. After first gene column next 6 columns are one condition and the second 6 columns are another condition. I want to remove the rows where each condition is having one 0 column. So for a rows if in both the conditions have one single 0 (which means each condition is having a 0 value) then that row is removed. How can we do that. I was doing like this earlier but we are left with some rows.

``````data1<- read.table("~/Desktop/Bonn_New_Pas_algo_data/results_30092013/edgeR_24122013/PGRTvsPDGRT/PGRTvsPDGRT_frags.txt",header=TRUE,stringsAsFactors=FALSE)
mat1<- as.matrix(data1[,-1])
row.names(mat1)<- data1[,1]
test <- apply(mat1, 1,function(x) all(x[1:6]==0) | all(x[7:12]==0) )
test1 <- mat1[!test,]
``````

But this does not seem correct as I am having some rows like :

``````XLOC_004043    251    233    116    61    79    73    163    287    120    103    0    60
XLOC_004046    75    0    1    1    1    53    1    1    1    0    33    1
XLOC_004048    26.0133    0    0    0    0    232.296    2.676    8.21482    2.61507    0    0    0
XLOC_004050    0    1    4    36    20    0    0    2    1    0    1    9
``````

output

``````XLOC_004043    251    233    116    61    79    73    163    287    120    103    0    60
``````

The script i wrote is not working , can anyone suggest how to do it?

R • 22k views
modified 7 months ago by cigdemselli10 • written 6.1 years ago by ivivek_ngs4.9k

Well, I have 20 columns with counts and two groups. I want to remove the rows where each group is having at least five 0 columns. First group 1:10 and second is 11:20. How can I use this line? Thank you.

1
cigdemselli10 wrote:

To exclude rows with zero values across all columns:

``````new_mtx <- exp_mtx [which(rowSums(exp_mtx) > 0), ]  # works much faster than below

new_mtx  <- apply (exp_mtx, 1, function(x) any(x==0) )
``````
0
Pavel Senin1.9k wrote:

The problem could be due to the numeric format conversion, this works for me:

``````data = read.table("data.txt",header=F)

names(data) = c ("gene", paste("cond_1_",1:6,sep=""), paste("cond_2_",1:6,sep=""))

keeps <- apply(data, 1, function(x) { ! ( any(as.numeric(x[2:7])==0) || any(as.numeric(x[8:13])==0) ) } )

data[keeps, ]

 gene     cond_1_1 cond_1_2 cond_1_3 cond_1_4 cond_1_5 cond_1_6 cond_2_1 cond_2_2
 cond_2_3 cond_2_4 cond_2_5 cond_2_6
<0 lignes> (ou 'row.names' de longueur nulle)
``````
0
Devon Ryan93k wrote:

Just change this line:

``````test <- apply(mat1, 1,function(x) all(x[1:6]==0) | all(x[7:12]==0) )
``````

to

``````test <- apply(mat1, 1,function(x) any(x==0) )
``````

to exclude any row with a single 0, or

``````test <- apply(mat1, 1,function(x) any(x[1:6]==0) && any(x[7:12]==0) )
``````

to exclude rows where each group has at least one 0 (it's somewhat ambiguous which behavior you want).

that's right, `any(x==0)` will work

yes actually the post I did is really ambiguous, as I have two conditions and these are fragment counts, so if there is 0 fragment count in both conditions, it will be not worth keeping those genes for differential expression analysis with edgeR. Since the corresponding fpkm will be too low to be detected by qPCR. So for edgeR which first calculates the dispersion among the replicates within one condition and then computes the differential expression with the other other condition, these genes with 0 fragment count will not be of any use. So now I have actually removed all the rows with even one 0 value in a column as it does not make sense to compute anything with a 0 fragment count even if its one column in the entire row. So I removed all rows having single 0 values. with the below code.

``````mat1<- as.matrix(data1[,-1])

row.names(mat1)<- data1[,1]

row_sub = apply(mat1, 1, function(row) all(row !=0 ))

res1=mat1[row_sub,]

m1<- na.omit(res1)
``````

FYI, none of those values should be used in edgeR. Don't ever use the output of cufflinks in edgeR (or DESeq, for that matter).

A second FYI, it's not an a priori good idea to remove columns due to there being a single 0 count. It's better to do proper independent filtering based on either summed or average counts (otherwise, you're already performing a test by excluding based on variance).

Then how do you generate a matrix for all your samples with fragment count? I am actually using geometric normalization in cuffdiff and running all samples together , so am getting the genes_read_group_tracking matrix from cuffdiff for all samples. Then am using an in house python script to select only the fragment count for each samples from there to create a matrix with genes and its corresponding fragment count for each samples. From there am preparing individual comparisons to which I am subjecting to edgeR for finding the differentially expressed genes. How else can you generate the fragment count list from your RNA-Seq data?

HTSeq-count or featureCounts would be the two standard ways. Data from cufflinks is fine to use only in cuffdiff (or similar packages intended for estimated count data). It should never be used in any of the raw-count based packages as those counts violate the statistical premise of the programs (garbage in, garbage out).

If this is a non-model organism or you just want to do a bit of de-novo feature discovery on a model organism, then take the merged GTF file produced by cuffmerge and do the counting based on that. There's no need to filter be counts before hand, just use the genefilter package in R (the newer versions of DESeq2 do this automagically).

"It's better to do proper independent filtering based on either summed or average counts (otherwise, you're already performing a test by excluding based on variance)." @dyryan79 - I would like to ask you on what basis you select the filtering based on fragment counts? I am thinking of selecting per condition the samples included and then compute the standard deviation and then include a threshold of the standard deviation for each gene. Is there any cut-offs you select based on mean or its arbitrary standards? I would like some suggestions from your end.