Question

how to calculate unique read coverage in WGS of human

1

Entering edit mode

4.9 years ago

Nitha ▴ 20

Hi All,

I have to calculate the coverage for human WGS of illumina sequenced read. After reading the technique note of illumina I have some doubts in WGS coverage calculation of human sequence.

( https://www.illumina.com/content/dam/illumina-marketing/documents/products/technotes/hiseq-x-30x-coverage-technical-note-770-2014-042.pdf), it talks about the "the average coverage of unique reads".

As far I know, the formula for calculating the sequence Coverage for WGS: Coverage =( total reads * length of read * 2 )/ length of genome sequenced. Whether there is any other the formula used for WGS coverage calculation? if so what is the difference strategy used by illumina platform for calculating coverage for WGS?
As I said before, after reading the technical Note of illumina (pdf from the link given above), in this pdf it says [Illumina defines sequencing coverage as “the average coverage of unique reads across the non-N portion of the human genome.”] My understanding of unique read is, "the read which mapped only once in a genome with a given number of mismatches" (please correct if my understanding wrong or limited). Could any one give an explanation of how the coverage is calculated for unique reads? I think that some time the adapter region may be assumed to calculate as unique reads? Is it so ?
Whether I have to remove duplicated before WGS coverage calculation?
whether anyone have a link or supporting document how Illumina is calculating the coverage for WGS of human?
What is the unique read enrichment? What is the important role of unique reads in WGS coverage calculation?

Any leads will be highly appreciated.

Thank in advance.

next-gen sequencing WGS human illumina • 3.4k views

ADD COMMENT • link updated 4.9 years ago by h.mon 35k • written 4.9 years ago by Nitha ▴ 20

1

Entering edit mode

The technical note you linked ends up at a "Page Not Found" message. As for your questions, I will briefly answer them:

the formula you used is for paired reads. If you have single-end reads, you don't multiply by two - or, to make the formula simpler and more general, just use as total reads, well, the total number of reads (all R1 reads + all R2 reads). If one were to be precise, the overlapping part of paired-end reads should be subtracted, as they are not "unique".
"unique reads" in this context means reads arising from an independent DNA fragment, not PCR or optical duplicates. Unless the library preparation or the library clusterization on the flowcell had problems, WGS should have a small number of technical duplicates, and most people just ignore them. Unique reads are important because sequencing several times the same PCR duplicate doesn't add information and, in fact, can introduce errors.

the read which mapped only once in a genome with a given number of mismatches

no, this is a uniquely mapped read.

ADD REPLY • link 4.9 years ago by h.mon 35k

0

Entering edit mode

Correct note PDF: https://www.illumina.com/content/dam/illumina-marketing/documents/products/technotes/hiseq-x-30x-coverage-technical-note-770-2014-042.pdf

ADD REPLY • link 4.9 years ago by GenoMax 142k

0

Entering edit mode

@genomax The link which had given is the technical note, which I had mentioned in my post. I couldn't understand the statement in the technical note [Illumina defines sequencing coverage as “the average coverage of unique reads across the non-N portion of the human genome.”]. could you explain it.

ADD REPLY • link 4.9 years ago by Nitha ▴ 20

0

Entering edit mode

@h.mon Thanks you. I am using paired end sequence. Now the above link is working, could you please explain how the illumina calculating the coverage, and what they mentioning about the unique reads. How Does Illumina Calculate Human WGS Coverage? Illumina defines sequencing coverage as “the average coverage of unique reads across the non-N portion of the human genome.”* Researchers can perform coverage calculations equivalent to Illumina calculation methods using BWA WGA App v1.05 or Isaac WGS App v2.0.6

ADD REPLY • link 4.9 years ago by Nitha ▴ 20

score 1 · Answer 1 · 2019-06-24

The formula:

coverage = total reads * length of read / length of genome

describes the expected sequencing coverage, that is, the theoretical coverage one would obtain (under some assumptions) with a given amount of sequencing for a given (haploid) genome size. This is know as the Lander/Waterman equation, and is mentioned at another Illumina tech note: Estimating Sequencing Coverage. See for example How Much Of The Genome Will Remain Un-Sequenced At A Given Coverage? for more discussions.

The tech note you linked (Sequencing Coverage Calculation Methods for Human Whole-Genome Sequencing) deals with the realized sequencing coverage, that is, the sequencing coverage obtained, after some artifacts are removed and some difficulties are taken into account. This coverage is calculated empirically: reads are mapped to the genome and the resulting coverage is calculated. Table 1 of the tech note shows which are the artifacts removed, and the genome size considered is the genome size of unambiguous bases after hard-masking the genome to remove repetitive regions.

This empirical sequencing coverage may vary according to the tools (and its parameters) used and choices as to which artifacts to remove.