Question

Pseudo-Counts or Zero-Containing True Counts for Mann-Whitney-U Test?

0

Entering edit mode

4.0 years ago

jparker4 ▴ 20

When running Mann-Whitney-U tests on count data out of feature counts, do you know if is correct to converted to pseudo-counts (however you want to do that, x +1 etc) or leave the zeros in? Previously I removed the zeros altogether when comparing counts over specific region of the genome for different sets of genes but then any difference I see is only applicable to those genes that have reads in those regions at all, and one of the sets of genes in the comparison could have many genes with no reads in the region of interest, so when you take this into account, the set which appeared to have a higher number of reads could actually be depleted of reads overall.

R ChIP-Seq • 1.4k views

ADD COMMENT • link updated 4.0 years ago by Asaf 10k • written 4.0 years ago by jparker4 ▴ 20

1

Entering edit mode

Please add details. Which kind of data do you have? How are the sample sizes? What do you compare? How did you normalize? If this is non-single-cell NGS data then the answer is probably something like "use DESeq2, edgeR, limma"

ADD REPLY • link 4.0 years ago by ATpoint 81k

0

Entering edit mode

The data is ChIP-seq data performed in duplicate. I'm comparing read counts over specific regions of the genome BETWEEN different sets of genes within conditions rather than between conditions, which is why I haven't used DESeq2/edgeR/limma. I normalised by the number of mapped reads.

ADD REPLY • link 4.0 years ago by jparker4 ▴ 20

1

Entering edit mode

I think for this to be meaningful you would also need to correct for mappability and GC content. Different regions may have strikingly different counts simply because GC bias and and uniqueness of the region cause this difference rather than biology.

ADD REPLY • link 4.0 years ago by ATpoint 81k

0

Entering edit mode

Even after correcting you will still have to show that more reads (or whatever score you end up with) = more protein bound to DNA using low-level, gold standard methods. I recall a paper doing this with RNAseq, it's not a perfect correlation but it works overall. I think RNAseq it much easier than ChIP-seq, the biological interpretation is broader.

ADD REPLY • link 4.0 years ago by Asaf 10k

0

Entering edit mode

Any advice on how to correct for these things? Can I use the inputs or IgG samples to correct for mappability by dividing by the counts for either of these samples?

ADD REPLY • link 4.0 years ago by jparker4 ▴ 20

0

Entering edit mode

You do a lot of black magic until you plot the numbers you get against the G/C content and see a line that looks straight. Don't divide counts unless they are closer to normal distribution than Poisson.

ADD REPLY • link 4.0 years ago by Asaf 10k

score 3 · Answer 1 · 2020-04-29

Mann-Whitney-U is a non-parametric test. When the count number is low it's a bit risky to use it because random sampling comes into play and difference between one read and two reads, becomes inflated. Adding a pseudocount shouldn't help - ranking is still the same.

There are plenty of statistical methods to analyze count data using either Negative-Binomial distribution, Poisson or other parametric methods that might be more accurate.