Question

How to remove low abundance and less prevalent data from my dataset?

0

Entering edit mode

3.5 years ago

deep771992chanda ▴ 40

Hi friends!!! I have a relative abundance table in .tsv format where samples are in columns and rows contain the features (pathways). Something like this reproducible example. (Sorry, I am unable to put the table here in question, thus inserting the link.)

Now, I want to keep features (i.e. pathways) that are present with abundance >0.0001 and present in at least 10% of samples. Can you please tell me how can I do that? I mean can you please suggest a bash command to achieve the purpose?

Thanks, dpc

metagenome bash • 1.3k views

ADD COMMENT • link 3.5 years ago by deep771992chanda ▴ 40

0

Entering edit mode

Hi friend. This would be easier to do in R.

present with abundance >0.0001

This is not well-defined. Is this mean abundance across all samples?

ADD REPLY • link 3.5 years ago by Kevin Blighe 87k

0

Entering edit mode

No, no Kevin. Actually if you add each column you will see a total of 1.00. It means each cell of a column shows relative abundance of corresponding row , i.e. pathway. There is no need to do calculation. The values in a cell itself denote relative abundance, so there is no need to do any calculation or something. Just selective rows to be kept or removed. I want to keep only those rows where at least 10% of its cells contains value >0.0001.For example, say, I have 2 rows and 20 columns. The first row contains cells where 2 cells (i.e. 10% of the all cells) have values > 0.0001 but other 18 cells have values <0.0001. We will keep these two rows in our output table. Suppose, the second row has one value >0.0001. This row will be removed because at least 10% of all cells i.e. at least 2 cells should contain values > 0.0001. Thanks

ADD REPLY • link 3.5 years ago by deep771992chanda ▴ 40

0

Entering edit mode

Just create a boolean matrix and then count the number of TRUE and FALSE per row. Something like:

apply(mat > 0.0001, 1, table)

..then, go from there.

Even better, this will return a single boolean vector of rows (genes) to keep that have 10% samples with values > 0.0001

apply(mat > 0.0001, 1, function(x) table(x)['TRUE']) > (ncol(mat)/100) * 10

ADD REPLY • link 3.5 years ago by Kevin Blighe 87k