Question

Compare two datasets knowing that one of them must have quite more mapped reads

0

Entering edit mode

5.8 years ago

VicGB • 0

Hi everyone,

I'm working with datasets in which I know one of them should have more signal globally specially in genes and I'm struggling with their comparison as most tools assume number of mapped reads as a factor to normalize.

For example, to illustrate what I mean, let's suppose you are blocking transcription machinery on one condition, so you should have less mapped reads per transcript in that condition with respect to a control. (To be clear, my dataset is not RNA-Seq, but another type of sequencing, and I also want to address enrichment on intergenic zones apart of genic ones ). In my true dataset, I expect (and actually see) that I have more mapped reads on the untreated sample than in the treated one (this is actually my negative control sample), and want to see enrichment over my treated sample, but peak callers like MACS2 or count-based approaches such as DESeq2 would decrease the signal (as far as I'm concerned).

I would like to do peak calling on it and some profile signal, but I think most peak callers try to scale down the dataset with highest number of mapped reads agaisnt the other one, and in the case of profile signals, I would use bamcompare or bamcoverage, but the normalization would lead again to a decrease signal in the sample with more mapped reads.

In short, would it be more properly to normalize datasets with respect to the total number of sequenced reads per condition in this case?

Normalization Sequencing ChIP-Seq • 1.3k views

ADD COMMENT • link updated 4.1 years ago by Biostar 20 • written 5.8 years ago by VicGB • 0

score 1 · Answer 1 · 2018-07-18

1

Entering edit mode

5.8 years ago

Benn 8.3k

If this is really RNA-seq, you shouldn't do peak calling at all, but quantify with e.g., featureCount. If the total reads of your libraries are very different, you might want to use limma voom.

ADD COMMENT • link 5.8 years ago by Benn 8.3k

0

Entering edit mode

Hi! This is not RNA-Seq. That example was only to illustrate the situation.

ADD REPLY • link 5.8 years ago by VicGB • 0

1

Entering edit mode

You used RNAseq tag, so we assumed it was RNAseq. Be precise in your question, then you get precise answers.

ADD REPLY • link 5.8 years ago by h.mon 35k

score 0 · Answer 2 · 2018-07-18

0

Entering edit mode

5.8 years ago

h.mon 35k

This is RNAseq, and you want to compare the expression control and treatment, right? So use edgeR or DESeq2, they will take care of the normalization for you. Both have pretty detailed user guides.

edit: as b.nota said, no peak calling is needed, use the genome annotation and use featureCounts to summarize counts over annotated genes.

ADD COMMENT • link 5.8 years ago by h.mon 35k

0

Entering edit mode

Sorry. I only wanted to illustrate a case with RNA-Seq data, but my dataset is not RNA-Seq Sequencing. We use on the treated sample a endonuclease which recognize some elements on the genome that should be present on the untreated sample, and that's why I expect more mapped reads (and actually what I see). The thing about DESeq2, is that, if I'm not wrong, it assumes that most transcripts (or other coordinates) won't change their expression, and it applies a median normalization got from a pseudo-reference count which I think the software would interprete that my untreated sample have been more deep sequenced than the treated sample.

ADD REPLY • link 5.8 years ago by VicGB • 0

0

Entering edit mode

See suggestions at deeptools bamcompare/bamcoverage merging bins? and Deeptools sample scaling .

ADD REPLY • link 5.8 years ago by h.mon 35k