Question: Downsampling single-end human ChIP-seq data
0
gravatar for msimmer92
12 months ago by
msimmer92250
Uruguay
msimmer92250 wrote:

The sequencing of this ChIP-seq H3K4me3 experiment (human, single-end 50bp) has a problem with the number of reads per sample. The person that did the sequencing did not arrange correctly the samples in the lanes, causing the bias you see in the picture (first, upper one). I am trying to "rescue" it doing downsampling, but wanted to make sure I am doing it correctly without loosing too much coverage that I loose robustness in whatever finding I have later on.

I did the downsampling as described in Ming Tang's blog : https://crazyhottommy.blogspot.com/2016/05/downsampling-for-bam-files-to-certain.html . Here there is an example line of code:

samtools view -s 3.6 -b my.bam -o subsample.bam

I did three downsamplings to see what happens: one to 20 M reads, one to 25M and one to 10M. In the figure below you only see raw data, 25M and 10M in that order (from top to bottom). My questions are:

1) How much you think I can downsample without having an unreasonable coverage? Is 10M ok or given that is single-end more should be used? (for example, in Girdhar et al 2018's schizophrenia chip-seq paper in humans they used 13M but it was paired-end sequencing).

2) If I end up deciding 10M is too little and I go with the 25 or 20M one (please, look at the second plot), you can still see a difference between the samples. Is that difference small enough to continue analysing or you think that's already too much?

3) In case in 2) you consider this difference is too big and that that's not good, you have another suggestion of what I could do?

Thank!

enter image description here

ADD COMMENTlink modified 12 months ago by colin.kern920 • written 12 months ago by msimmer92250
4
gravatar for colin.kern
12 months ago by
colin.kern920
United States
colin.kern920 wrote:

1) How much you think I can downsample without having an unreasonable coverage? Is 10M ok or given that is single-end more should be used? (for example, in Girdhar et al 2018's schizophrenia chip-seq paper in humans they used 13M but it was paired-end sequencing).

The current ENCODE standard for H3K4me3 is a minimum of 20M mapped and filtered reads, however the previous standard was 10M. Because the important aspect of ChIP-seq analysis is the location of the alignments, not the actual sequence content of the reads, as long as the alignment rate is good with single-end reads then it's not much different from a similar paired-end depth.

2) If I end up deciding 10M is too little and I go with the 25 or 20M one (please, look at the second plot), you can still see a difference between the samples. Is that difference small enough to continue analysing or you think that's already too much?

It depends on the analysis you will be doing with the data. If your analysis is going to be based on looking at peak calls, for example, I would go ahead and do your peak calling on the full data and see how affected that is by the differences in depths.

Btw, are the numbers you're showing the raw reads or the number of reads after mapping and removing duplicates?

ADD COMMENTlink written 12 months ago by colin.kern920

Thank you ! It's good to know (point 1)). Yeah, I will do what you suggest in 2). Answering your question, these numbers are after samtools markdup and filtering out those alignments with MAPQ below 30 (so keeping uniquely mapped reads only). These read numbers you see in the plot I took them from checking those BAMs using samtools flagstat, and then I plotted them in a bar chart.

ADD REPLYlink written 12 months ago by msimmer92250
1
gravatar for ATpoint
12 months ago by
ATpoint36k
Germany
ATpoint36k wrote:

Can you give some background on how the samples are related? If all these samples go into the same differential analysis, I would not do anything towards downsampling, you only throw away information. Call peaks over all reads of IP combined against all input reads combined and then proceed with differential analysis. The normalization of e.g. csaw should take care of the rest. Check by PCA/MDS and Pearson corr. are confounded by the low depth. If so, maybe removing a few of them is better than downsampling as this would penalise all samples. Depends on the question and analysis strategy.

ADD COMMENTlink written 12 months ago by ATpoint36k

Yes, sorry. These are samples from humans with different disease stages and then controls (groups A-H, one of them is control and the others incremental stages). H is control, then A is the first mild stage and E is the worst. The idea is to call peaks with MACS2 and do differential binding analysis with DiffBind, further annotation and pathway analysis with R's ChIPseeker and ClusterProfiler. After consulting with you and others in the Biostar chat, I am doing the CSAW analysis in parallel as a control (at least to see how the MA plots look like). The normalization method in DiffBind that I selected is the default (TMM, full library size) but there's also the option of choosing "TMM effective size". The suggestions you made are good, thank you!

ADD REPLYlink modified 12 months ago • written 12 months ago by msimmer92250
1

If possible, maybe just re-sequence the under-sequenced samples to get some more reads given the quality control of the data you have looks promising in terms of IP efficiency etc.

ADD REPLYlink written 12 months ago by ATpoint36k

This would be my preference but unfortunately the guys that produce this data ran out of tissue and getting more is not easy. But thank you! Additionally, I want to say I tried the downsampling and the results look the same (in the sense that the results I'm observing seem to be so strong and the DiffBind normalization so good that they are still well-seen even with that). I tried other normalization methods and I see the same. I still have to try more suggestions that people here did to me, but so far it's looking good. Thank you very much :)

ADD REPLYlink written 12 months ago by msimmer92250

I mean sequence the exact same pool of DNA again. For a NGS run you typically do only need a fraction of the entire library. There should be something left in the tube, no? I typically submit like 25-50% of a sample to sequencing to have a backup in exactly that case or in case of technical errors with the machine.

ADD REPLYlink written 12 months ago by ATpoint36k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1181 users visited in the last hour