Question: What causes strange oscillations in the count distribution of Salmon and STAR aligned gene tables?
0
gravatar for Gabriel
10 months ago by
Gabriel90
Paris
Gabriel90 wrote:

Background

DESeq2 uses independent filtering to filter out low read genes from analysis. Nevertheless it is customary to pre-filter ultra low read genes(usually rowsum(gene) > 10 read is the basic condition), this reduces load on DESeq2, but it also gives us fewer false positives because genes with low reads have a much higher rate of error for given p-values when only one group has reads. I'm sure you are all familiar with this.

I wasn't sure weather to apply this to salmon, but I did and it gives me better results. Now I am filtering salmon read counts with below 6 reads on average before putting them into DESeq2 for DGE

But I checked the probability distribution of the counts, both for STAR and SALMON aligned count tables it shows a massive amount of noise for counts all the way to 20 counts per average.

Question

What is causing the spikes in density for very low counts(below 20) for both salmon and STAR? See in images attached. count distribution of salmon aligned reads count distribution of STAR aligned reads

Note that both gene counts are made from the same FASTQ raw data.

My personal guess is that it's a problem of sampling frequency, since reads are somewhat quantized, as you approach 0, the "bit size" becomes smaller, and you get oscillations, something like Nyquist limit. I haven't tried to verify this. But I have found there is always a spike of probability to find 0, 1 or 2 reads per gene.

counts filtering star salmon htseq • 359 views
ADD COMMENTlink modified 10 months ago • written 10 months ago by Gabriel90
1

it is customary to pre-filter ultra low read genes

Although this is occasionally done, I would not say it is customary, at least for DESeq2.

ADD REPLYlink modified 10 months ago • written 10 months ago by igor12k

It depends I would say. Mike Love, the author of DESeq2 does not explicitely recommend it and says in the vignette that it is typically not necessary. In contrasts the edgeR maintainers explicitely recommend it by using FilterByExpr on the count matrix.

ADD REPLYlink written 10 months ago by ATpoint44k

The post only mentioned DESeq2, so I should've clarified.

But yes, for different tools, the common workflows will vary.

ADD REPLYlink written 10 months ago by igor12k

Isnt that normal since a large number of genes is not or barely-expressed and therefore inflates the density for 0/1/2? For the higher counts you basically have the span from non-low counts to infinity, and for non-expressed genes you have 0 to a small number like 10 or so. I do not find this surprising. Did you use tximport (unrelated to this question, just asking)?

ADD REPLYlink modified 10 months ago • written 10 months ago by ATpoint44k

That's what I was thinking is the reason. However tximport counts are not strictly quantized because salmon uses statistical inference to "estimate" the couts, however, the reads near 0 are mostly quatized to integers. I do not understand fully the implications of this, or why it's mostly quantized for some reads but not for others

Un-normalized count table from salmon

Un-normalized count table from salmon, low reads

I did use tximport yes

ADD REPLYlink modified 10 months ago • written 10 months ago by Gabriel90

the reads near 0 are mostly quatized to integers

You are seeing the ones near 0 because you are on a log-scale. STAR counts are integers and you can see the curve starts to look smooth very quickly.

Also, a small adjustment with low counts will be more likely to give you the same number when rounded to the nearest integer. For example, 0.99 * 1 is still very close to 1, but 0.99 * 100 is 99 (a whole integer away).

ADD REPLYlink written 10 months ago by igor12k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2521 users visited in the last hour
_