Inconsistent Coverage on X chromosome between genders
1
0
Entering edit mode
5.0 years ago

Hi all,

I have a dataset of ~200 exomes from Agilent V4 71MB. I calculated coverage with bedtools for off-target and on-target regions. Then I normalised them to library size (divide by the total number of reads in on-target or off-target regions).

Then I divided my cohort into 2: males and females. For each pair Male-Female I calculated median ratio between coverages of the corresponding regions on X chromosome.

In the end I checked if the ratio is close to expected (0.5 or 2). I was surpised to see that for on-target reads the median of ratios was finally 1.94, and for off-target - 1.72! This is quite far from the expected 2.

I also tried to do subsampling, different coverage normalizations such as GC content, variance stabilization, however, I always have numbers close to that.

Do smb faced this problem? Could you suggest some reasons of that?

Example of code for getting ratios in R:

> for (i in 1:ncol(x_coverages_males)) {
+   for (j in 1:ncol(x_coverages_females)) {
+     ratio <- c(ratio, median(x_coverages_females[,j] / x_coverages_males[,i]))
+   }
+ }
> summary(ratio)
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
1.099   1.910   1.941   1.930   1.982   2.170


UPD: I am not SO confused about on-target reads - yeap, it is 1.94 instead of 2, but still the bias is not so big. But for off-target it can not be ignored for sure :/

UPD2 and answer: the key was in mappability. The mappability filter should be not MAPQ=1 or 2, but MAPQ=10. It gives ratios close to 2, yeah.

coverage WES • 1.6k views
2
Entering edit mode

Did you forget about the pseudoautosomal regions? You should count X-chromosome coverage outside of those.

0
Entering edit mode

Of course I forgot, thank you! But according to wiki they are ~1.5 percents of the sex chromosomes, and the median was used as an estimation of ratios, I will remove them and report the result here, but I doubt that ~1.5 percents may make shift of ratios from 2 to 1.72 ...

0
Entering edit mode

Excluded them, the ratios improved only marginally. But you gave an excellent idea, will try to filter reads by mappability more strictly.

0
Entering edit mode

I m not sure to understand what you mean about target and off target. For the problem if i understand correctly is that you don't have the ratio between male an female you expected right ?
To me ( may be i m wrong ) to get this ratio you should check heterozygous and homozygous SNP , because i don't know your protocol but you have selected region (target region ) so that's mean enrichment and after amplification , so at the end quit same number of reads for X chromosome.

0
Entering edit mode

Yeap, that's right, I do not get the coverage ratio between males and females as expected.

For ratio between SNVs is is well known that for exome seq it is more close to ~0.48 than to 0.5, so I would say it is kinda useless to check that - it is uneven by default. "at the end quit same number of reads for X chromosome." - yeap, that's what I am confused about

On-targets means "reads from the regions of enrichment", and off-target reads are "reads that are > several insert distances from any target", so basically "reads you do not asked for mostly from intronic/intergenic regions". Typically for WES 30-40% of reads are off-target.

0
Entering edit mode

Well don't you think enrichment / selected region saturation (during library preparation) is the main reason ?

0
Entering edit mode

Yeap, but technically, if you have 1 copy or 2 copies - the processes should be similar for both of them and in the end we will have ratio close to 2 anyway.

1
Entering edit mode

Based on the replies here I guess I am not the only one who doesn't fully understand the issue at hand. Can you clarify what your expectations are for both, on-target and off-target, and why?

0
Entering edit mode

I expect the number of reads on X chromosome be doubled for females comparing to males regardless are they on-target or off-target. This is a common assumption that arise from the fact that we have 2 copies of the region in females and 1 copy for males.

0
Entering edit mode

i think during the capture your DNA is in saturation compare of your probes, so at he end you have the same quantity of DNA amplified. That confirmed by the ration between homozygous and heterozygous SNP you should get.

So i have wrong theory but i don't understand why there is this ratio ....

0
Entering edit mode

I finally understood your concerns and it makes sense. I will try to include this "saturation" as the parameter in my estimations. But such drastic change in number of reads was partially explained by the "uncertainly" mapped reads. I excluded all reads with MAPQ < 2 and I had median ratio of 1.923 for off-target reads which is much closer to the expected 2. I am trying to play with MAPQ now and will update the question accordingly to the results.

0
Entering edit mode

My colleague who actually knows what happens in Agilent enrichment kits told me that there are much more probes than DNA so the effect should be minor. But we'll know tomorrow when the results will be finished.

0
Entering edit mode

Thank's to take the time to check that bias, and give me a feed back if you have one :)

0
Entering edit mode

I think this bias actually exists. I was thinking that strict mappability filtering will eliminate this bias, but finally no, the ratio is still 1.93, not 2. So I guess there is no other explanation except saturation of probes...

0
Entering edit mode

Thank you for the reply. I discuss with so colleagues in the lab , they explained to me that during amplification if the PCR cycle are under 35 cycles you can be quantitative for X. That's make sens too.

0
Entering edit mode

kuckunniwid : Add the info you updated above as an answer and then accept (green check mark) to provide closure for this thread.

0
Entering edit mode

Thanks! I will close when the issue will be finally solved (hopefully tomorrow), I mistakenly reported it as "answered" while there is still some uncertainty here.

3
Entering edit mode
5.0 years ago

The answer to the problem seems to have 2 parts:

1) in case you do not filter reads with high probabilitiy of misalignment, you will have a lot of "background" reads on your genome. That means that erroneous reads from all the genome will randomly map to some locations. It makes a huge effect on off-target reads, shifting the expected ratios from 2 to ~1.7.

2) it seems that the number of probes in the enrichment kits is not gender specific. So if the person has 1 X chromosome or 2 X choromosomes - the same number of probes will try to hybridase with them, and the probability of the event = (hybridization with 1 chromosome) when only 1 copy is presented will be higher that the probability of the same event for 2 copies. Thanks @Titus for the suggestion. So even after strict mappability filtering we will get the ratio ~1.93 instead of 2.

1
Entering edit mode

You can add this condition : during amplification if the PCR cycle are under 35 cycles you can be quantitative for X