Question: How To Remove Rows In A Matrix Having 0 As Value In Columns In Both Conditions
1
gravatar for vchris_ngs
3.9 years ago by
vchris_ngs4.2k
Milan, Italy
vchris_ngs4.2k wrote:

I have few queries. I am having a matrix with some counts. there are 2 conditions for each. So I have a matrix with first column as genes and 12 columns with counts. After first gene column next 6 columns are one condition and the second 6 columns are another condition. I want to remove the rows where each condition is having one 0 column. So for a rows if in both the conditions have one single 0 (which means each condition is having a 0 value) then that row is removed. How can we do that. I was doing like this earlier but we are left with some rows.

data1<- read.table("~/Desktop/Bonn_New_Pas_algo_data/results_30092013/edgeR_24122013/PGRTvsPDGRT/PGRTvsPDGRT_frags.txt",header=TRUE,stringsAsFactors=FALSE)
mat1<- as.matrix(data1[,-1])
row.names(mat1)<- data1[,1]
test <- apply(mat1, 1,function(x) all(x[1:6]==0) | all(x[7:12]==0) )
test1 <- mat1[!test,]

But this does not seem correct as I am having some rows like :

XLOC_004043    251    233    116    61    79    73    163    287    120    103    0    60
XLOC_004046    75    0    1    1    1    53    1    1    1    0    33    1
XLOC_004048    26.0133    0    0    0    0    232.296    2.676    8.21482    2.61507    0    0    0
XLOC_004050    0    1    4    36    20    0    0    2    1    0    1    9

output

XLOC_004043    251    233    116    61    79    73    163    287    120    103    0    60

The script i wrote is not working , can anyone suggest how to do it?

R • 11k views
ADD COMMENTlink modified 3.9 years ago by Devon Ryan73k • written 3.9 years ago by vchris_ngs4.2k
0
gravatar for Pavel Senin
3.9 years ago by
Pavel Senin1.8k
Los Alamos, NM
Pavel Senin1.8k wrote:

The problem could be due to the numeric format conversion, this works for me:

data = read.table("data.txt",header=F)

names(data) = c ("gene", paste("cond_1_",1:6,sep=""), paste("cond_2_",1:6,sep=""))

keeps <- apply(data, 1, function(x) { ! ( any(as.numeric(x[2:7])==0) || any(as.numeric(x[8:13])==0) ) } )

data[keeps, ]

 [1] gene     cond_1_1 cond_1_2 cond_1_3 cond_1_4 cond_1_5 cond_1_6 cond_2_1 cond_2_2
[10] cond_2_3 cond_2_4 cond_2_5 cond_2_6
<0 lignes> (ou 'row.names' de longueur nulle)
ADD COMMENTlink modified 3.9 years ago • written 3.9 years ago by Pavel Senin1.8k
0
gravatar for Devon Ryan
3.9 years ago by
Devon Ryan73k
Freiburg, Germany
Devon Ryan73k wrote:

Just change this line:

test <- apply(mat1, 1,function(x) all(x[1:6]==0) | all(x[7:12]==0) )

to

test <- apply(mat1, 1,function(x) any(x==0) )

to exclude any row with a single 0, or

test <- apply(mat1, 1,function(x) any(x[1:6]==0) && any(x[7:12]==0) )

to exclude rows where each group has at least one 0 (it's somewhat ambiguous which behavior you want).

ADD COMMENTlink modified 3.9 years ago • written 3.9 years ago by Devon Ryan73k

that's right, any(x==0) will work

ADD REPLYlink written 3.9 years ago by Pavel Senin1.8k

yes actually the post I did is really ambiguous, as I have two conditions and these are fragment counts, so if there is 0 fragment count in both conditions, it will be not worth keeping those genes for differential expression analysis with edgeR. Since the corresponding fpkm will be too low to be detected by qPCR. So for edgeR which first calculates the dispersion among the replicates within one condition and then computes the differential expression with the other other condition, these genes with 0 fragment count will not be of any use. So now I have actually removed all the rows with even one 0 value in a column as it does not make sense to compute anything with a 0 fragment count even if its one column in the entire row. So I removed all rows having single 0 values. with the below code.

mat1<- as.matrix(data1[,-1])

row.names(mat1)<- data1[,1]

row_sub = apply(mat1, 1, function(row) all(row !=0 ))

res1=mat1[row_sub,]

m1<- na.omit(res1)
ADD REPLYlink modified 3.9 years ago by Devon Ryan73k • written 3.9 years ago by vchris_ngs4.2k

FYI, none of those values should be used in edgeR. Don't ever use the output of cufflinks in edgeR (or DESeq, for that matter).

ADD REPLYlink written 3.9 years ago by Devon Ryan73k

A second FYI, it's not an a priori good idea to remove columns due to there being a single 0 count. It's better to do proper independent filtering based on either summed or average counts (otherwise, you're already performing a test by excluding based on variance).

ADD REPLYlink written 3.9 years ago by Devon Ryan73k

Then how do you generate a matrix for all your samples with fragment count? I am actually using geometric normalization in cuffdiff and running all samples together , so am getting the genes_read_group_tracking matrix from cuffdiff for all samples. Then am using an in house python script to select only the fragment count for each samples from there to create a matrix with genes and its corresponding fragment count for each samples. From there am preparing individual comparisons to which I am subjecting to edgeR for finding the differentially expressed genes. How else can you generate the fragment count list from your RNA-Seq data?

ADD REPLYlink written 3.9 years ago by vchris_ngs4.2k

HTSeq-count or featureCounts would be the two standard ways. Data from cufflinks is fine to use only in cuffdiff (or similar packages intended for estimated count data). It should never be used in any of the raw-count based packages as those counts violate the statistical premise of the programs (garbage in, garbage out).

If this is a non-model organism or you just want to do a bit of de-novo feature discovery on a model organism, then take the merged GTF file produced by cuffmerge and do the counting based on that. There's no need to filter be counts before hand, just use the genefilter package in R (the newer versions of DESeq2 do this automagically).

ADD REPLYlink written 3.9 years ago by Devon Ryan73k

"It's better to do proper independent filtering based on either summed or average counts (otherwise, you're already performing a test by excluding based on variance)." @dyryan79 - I would like to ask you on what basis you select the filtering based on fragment counts? I am thinking of selecting per condition the samples included and then compute the standard deviation and then include a threshold of the standard deviation for each gene. Is there any cut-offs you select based on mean or its arbitrary standards? I would like some suggestions from your end.

ADD REPLYlink written 3.9 years ago by vchris_ngs4.2k

You really really should not filter based on standard deviation, that's statistically indefensible. Have a read through the following paper from Wolfgang Huber's group, which lays down the statistical framework and introduces the genefilter package. In short, you can maximize power for detecting DE genes by filtering based on summed or average counts across samples without introducing bias.

ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by Devon Ryan73k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1371 users visited in the last hour