fastqc and coverage
2
0
Entering edit mode
4.8 years ago
aleka ▴ 110

I have 50% duplicates on WGS on tumour samples and while I was expecting that the coverage will be reduced from 30x to 15x, it goes below 8x. So I am trying to figure out the reason that I have less coverage.

I can see that I have a warning/failure on the per tile sequence quality. I suspect that this can affect the number of reads, can it though affect the coverage as well?

next-gen sequencing • 2.5k views
0
Entering edit mode

Have you looked at this blog post? FastQC is great at pinpointing characteristics of your dataset that you should look at more closely. The "failures" are an essential part of this process. If your per tile sequence quality indeed has an issue then post an image here. Likely those sequences may be taken care of by trimming/filtering (which you will be doing next). Any reduction in number of reads, as a result, will affect gross coverage.

0
Entering edit mode

What sequencing platform? What library prep method?

0
Entering edit mode

How are you determining the duplication rate? Is it just from FastQC or based on the alignment (Picard MarkDuplicates, for example)? Those can be very different.

1
Entering edit mode

Most likely on FastQC. Seeing those red "X" on FastQC output seems to stop people in their tracks :)

1
Entering edit mode

Sometimes I wish it stopped them. Just distracts and upsets them in my experience.

0
Entering edit mode
4.8 years ago
Dan D 7.2k

I recommend running Picard's CollectWgsMetrics Tool.

It will take a while because the analysis is exhaustive, but the provided metrics will give a thorough perspective, including percentage of optical duplicates, marked duplicates, and overlapping bases, among other things.

0
Entering edit mode

Hi all,

Thanks for the feedback. The duplicate rate is based on fastqc that I run. After trimming and filtering, I don't have overrepresented seq or adapter content. The 50% duplication rate is after mapping, trimming. I loose a few million reads due to trimming but why doesn't explain the big loss of coverage (expected to get ~15x and it drops below 8x). I run CollectWgsMetrics Tool at the moment, so I see if I get anything from there. The data are illumina.

Aleka

0
Entering edit mode
4.8 years ago
aleka ▴ 110

So I had calculated the duplicate rate according to fastqc, however this is not absolutely correct as it is an estimate of the first 100000seq. According to CollectWgsMetrics Tool and the metrics in MarkDuplicates, the duplicate rate is much higher, which fit the mean coverage, calculated with depthofcoverage GATK.

However, the mean coverage calculated by CollectWgsMetrics Tool is smaller than the one I calculated from the depthofCoverage average column. I suppose this happens because the CollectWgsMetrics Tool calculates the mean coverage in bases of the genome territory (non-N bases in the genome), after all filters are applied. Is that right?

Also I am not sure what are the filter that are applied in the CollectWgsMetrics Tool. Does anyone know?

Aleka

0
Entering edit mode

Please use ADD COMMENT/ADD REPLY when responding to existing posts or providing additional information. This helps keep the threads logically organized.

0
Entering edit mode

this was an answer to my initial question, if you didn't realise it. Aleka

0
Entering edit mode

Latest note is adding useful information since it appears that the problem is worse than you initially suspected. If you have such a high % of duplicates (I assume they are PCR duplicates if you marked them with Picard) then perhaps something went wrong with the experiment (low input, too many PCR cycles)? Do you see any visual evidence that this sample has uneven coverage across the genome?

0
Entering edit mode

good point about the uneven coverage. you reminded me to check. I didn't have though uneven coverage. more less the same.

0
Entering edit mode

If the coverage is not uneven then why do you have so many duplicates? Is there an experimental explanation? Is this a HiSeq 4000 dataset?

0
Entering edit mode

Did you use the default parameters for CollectWgsMetrics, if so then the default mapping and base qualities are 20. Also COUNT_UNPAIRED parameter needs to be set if you have too many one-end reads mapping from paired-end data.

0
Entering edit mode

yes I use the default parameters. that would make sense why CollectWgsMetrics gave a smaller coverage.