Hi All,
I have to calculate the coverage for human WGS of illumina sequenced read. After reading the technique note of illumina I have some doubts in WGS coverage calculation of human sequence.
( https://www.illumina.com/content/dam/illumina-marketing/documents/products/technotes/hiseq-x-30x-coverage-technical-note-770-2014-042.pdf), it talks about the "the average coverage of unique reads".
As far I know, the formula for calculating the sequence Coverage for WGS: Coverage =( total reads * length of read * 2 )/ length of genome sequenced. Whether there is any other the formula used for WGS coverage calculation? if so what is the difference strategy used by illumina platform for calculating coverage for WGS?
As I said before, after reading the technical Note of illumina (pdf from the link given above), in this pdf it says [Illumina defines sequencing coverage as “the average coverage of unique reads across the non-N portion of the human genome.”] My understanding of unique read is, "the read which mapped only once in a genome with a given number of mismatches" (please correct if my understanding wrong or limited). Could any one give an explanation of how the coverage is calculated for unique reads? I think that some time the adapter region may be assumed to calculate as unique reads? Is it so ?
Whether I have to remove duplicated before WGS coverage calculation?
whether anyone have a link or supporting document how Illumina is calculating the coverage for WGS of human?
What is the unique read enrichment? What is the important role of unique reads in WGS coverage calculation?
Any leads will be highly appreciated.
Thank in advance.
The technical note you linked ends up at a "Page Not Found" message. As for your questions, I will briefly answer them:
the formula you used is for paired reads. If you have single-end reads, you don't multiply by two - or, to make the formula simpler and more general, just use as total reads, well, the total number of reads (all R1 reads + all R2 reads). If one were to be precise, the overlapping part of paired-end reads should be subtracted, as they are not "unique".
"unique reads" in this context means reads arising from an independent DNA fragment, not PCR or optical duplicates. Unless the library preparation or the library clusterization on the flowcell had problems, WGS should have a small number of technical duplicates, and most people just ignore them. Unique reads are important because sequencing several times the same PCR duplicate doesn't add information and, in fact, can introduce errors.
Correct note PDF: https://www.illumina.com/content/dam/illumina-marketing/documents/products/technotes/hiseq-x-30x-coverage-technical-note-770-2014-042.pdf
@genomax The link which had given is the technical note, which I had mentioned in my post. I couldn't understand the statement in the technical note [Illumina defines sequencing coverage as “the average coverage of unique reads across the non-N portion of the human genome.”]. could you explain it.
@h.mon Thanks you. I am using paired end sequence. Now the above link is working, could you please explain how the illumina calculating the coverage, and what they mentioning about the unique reads. How Does Illumina Calculate Human WGS Coverage? Illumina defines sequencing coverage as “the average coverage of unique reads across the non-N portion of the human genome.”* Researchers can perform coverage calculations equivalent to Illumina calculation methods using BWA WGA App v1.05 or Isaac WGS App v2.0.6