Is it normal for most data to be lost when filtering data?
1
0
Entering edit mode
2.2 years ago
seda ▴ 10

Hi everyone!

I have metagenome dataset. I was trying to find differential expressions of my samples (I have only two samples). Therefore, I used edgeR package in my count data. In filter the data step, before filtering, my dataframe's dimension was [3005,2]. It remained [71,2] after the filtering step. Most of the data have been lost, and one of the samples has only 1715 zero values, the other sample has only 1776 zero values. I used this code: keep <- rowSums(cpm(y)>100) >= 2.

Is it normal for dimension to go down from 3005 to 71? If anybody has an idea about filtering data, I am open to all suggestions. Thanks all!

edgeR R filter • 802 views
ADD COMMENT
3
Entering edit mode

For edgeR the default min.count if you use their filterByExpr function is 10. Your cutoff of 100 is likely high for you data. It would be a good idea to make a histogram of counts to check.

ADD REPLY
2
Entering edit mode

Thanks for replying but OP's cutoff isn't directly comparable to min.count. The edgeR threshold of 10 is for counts whereas OP is applying a cutoff to the counts-per-million. The edgeR threshold is required only for some samples whereas OP is requiring the cutoff to be satisfied for every sample. If the sequencing depth is 10 million reads per sample (say) then OP's cpm cutoff corresponds to a count of 1,000 for each sample and at least 2,000 for the row sum. No wonder they lose most of their data.

ADD REPLY
1
Entering edit mode

Yep, I agree that a cutoff of 100 is kinda of high. With the default (removing the genes that have 10 or less counts) you already see a big decrease in the size of the data. In addition, genes near 100 counts are already being trascribed.

ADD REPLY
3
Entering edit mode
2.2 years ago
Gordon Smyth ★ 7.0k

No it isn't at all normal to remove most data. The code you're using is a bit crazy and is bound to remove most of the data. A cpm cutoff around 1 instead of 100 would be more usual. But why not follow the edgeR User's Guide and use

keep <- filterByExpr(y)

To be honest, you hardly need to filter at all. Since you don't have any replication, and hence can't estimate the dispersion, you really only need to remove rows that are all zero.

ADD COMMENT

Login before adding your answer.

Traffic: 2736 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6