How To Remove Rows In A Matrix Having 0 As Value In Columns In Both Conditions
3
2
Entering edit mode
10.7 years ago
ivivek_ngs ★ 5.2k

I have few queries. I am having a matrix with some counts. there are 2 conditions for each. So I have a matrix with first column as genes and 12 columns with counts. After first gene column next 6 columns are one condition and the second 6 columns are another condition. I want to remove the rows where each condition is having one 0 column. So for a rows if in both the conditions have one single 0 (which means each condition is having a 0 value) then that row is removed. How can we do that. I was doing like this earlier but we are left with some rows.

data1<- read.table("~/Desktop/Bonn_New_Pas_algo_data/results_30092013/edgeR_24122013/PGRTvsPDGRT/PGRTvsPDGRT_frags.txt",header=TRUE,stringsAsFactors=FALSE)
mat1<- as.matrix(data1[,-1])
row.names(mat1)<- data1[,1]
test <- apply(mat1, 1,function(x) all(x[1:6]==0) | all(x[7:12]==0) )
test1 <- mat1[!test,]


But this does not seem correct as I am having some rows like :

XLOC_004043    251    233    116    61    79    73    163    287    120    103    0    60
XLOC_004046    75    0    1    1    1    53    1    1    1    0    33    1
XLOC_004048    26.0133    0    0    0    0    232.296    2.676    8.21482    2.61507    0    0    0
XLOC_004050    0    1    4    36    20    0    0    2    1    0    1    9


output

XLOC_004043    251    233    116    61    79    73    163    287    120    103    0    60


The script i wrote is not working , can anyone suggest how to do it?

r • 34k views
0
Entering edit mode

Well, I have 20 columns with counts and two groups. I want to remove the rows where each group is having at least five 0 columns. First group 1:10 and second is 11:20. How can I use this line? Thank you.

3
Entering edit mode
5.3 years ago
cigdemselli ▴ 30

To exclude rows with zero values across all columns:

new_mtx <- exp_mtx [which(rowSums(exp_mtx) > 0), ]  # works much faster than below

new_mtx  <- apply (exp_mtx, 1, function(x) any(x==0) )

0
Entering edit mode
10.7 years ago
Pavel Senin ★ 1.9k

The problem could be due to the numeric format conversion, this works for me:

data = read.table("data.txt",header=F)

names(data) = c ("gene", paste("cond_1_",1:6,sep=""), paste("cond_2_",1:6,sep=""))

keeps <- apply(data, 1, function(x) { ! ( any(as.numeric(x[2:7])==0) || any(as.numeric(x[8:13])==0) ) } )

data[keeps, ]

[1] gene     cond_1_1 cond_1_2 cond_1_3 cond_1_4 cond_1_5 cond_1_6 cond_2_1 cond_2_2
[10] cond_2_3 cond_2_4 cond_2_5 cond_2_6
<0 lignes> (ou 'row.names' de longueur nulle)

0
Entering edit mode
10.7 years ago

Just change this line:

test <- apply(mat1, 1,function(x) all(x[1:6]==0) | all(x[7:12]==0) )


to

test <- apply(mat1, 1,function(x) any(x==0) )


to exclude any row with a single 0, or

test <- apply(mat1, 1,function(x) any(x[1:6]==0) && any(x[7:12]==0) )


to exclude rows where each group has at least one 0 (it's somewhat ambiguous which behavior you want).

0
Entering edit mode

that's right, any(x==0) will work

0
Entering edit mode

yes actually the post I did is really ambiguous, as I have two conditions and these are fragment counts, so if there is 0 fragment count in both conditions, it will be not worth keeping those genes for differential expression analysis with edgeR. Since the corresponding fpkm will be too low to be detected by qPCR. So for edgeR which first calculates the dispersion among the replicates within one condition and then computes the differential expression with the other other condition, these genes with 0 fragment count will not be of any use. So now I have actually removed all the rows with even one 0 value in a column as it does not make sense to compute anything with a 0 fragment count even if its one column in the entire row. So I removed all rows having single 0 values. with the below code.

mat1<- as.matrix(data1[,-1])

row.names(mat1)<- data1[,1]

row_sub = apply(mat1, 1, function(row) all(row !=0 ))

res1=mat1[row_sub,]

m1<- na.omit(res1)

0
Entering edit mode

FYI, none of those values should be used in edgeR. Don't ever use the output of cufflinks in edgeR (or DESeq, for that matter).

0
Entering edit mode

A second FYI, it's not an a priori good idea to remove columns due to there being a single 0 count. It's better to do proper independent filtering based on either summed or average counts (otherwise, you're already performing a test by excluding based on variance).

0
Entering edit mode

Then how do you generate a matrix for all your samples with fragment count? I am actually using geometric normalization in cuffdiff and running all samples together , so am getting the genes_read_group_tracking matrix from cuffdiff for all samples. Then am using an in house python script to select only the fragment count for each samples from there to create a matrix with genes and its corresponding fragment count for each samples. From there am preparing individual comparisons to which I am subjecting to edgeR for finding the differentially expressed genes. How else can you generate the fragment count list from your RNA-Seq data?

0
Entering edit mode

HTSeq-count or featureCounts would be the two standard ways. Data from cufflinks is fine to use only in cuffdiff (or similar packages intended for estimated count data). It should never be used in any of the raw-count based packages as those counts violate the statistical premise of the programs (garbage in, garbage out).

If this is a non-model organism or you just want to do a bit of de-novo feature discovery on a model organism, then take the merged GTF file produced by cuffmerge and do the counting based on that. There's no need to filter be counts before hand, just use the genefilter package in R (the newer versions of DESeq2 do this automagically).

0
Entering edit mode

"It's better to do proper independent filtering based on either summed or average counts (otherwise, you're already performing a test by excluding based on variance)." @dyryan79 - I would like to ask you on what basis you select the filtering based on fragment counts? I am thinking of selecting per condition the samples included and then compute the standard deviation and then include a threshold of the standard deviation for each gene. Is there any cut-offs you select based on mean or its arbitrary standards? I would like some suggestions from your end.

0
Entering edit mode

You really really should not filter based on standard deviation, that's statistically indefensible. Have a read through the following paper from Wolfgang Huber's group, which lays down the statistical framework and introduces the genefilter package. In short, you can maximize power for detecting DE genes by filtering based on summed or average counts across samples without introducing bias.