This image was obtained from this paper.
This the image description as on the paper.
DNA-sequence reads from a tumor sample are aligned to a reference genome (shown in gray). Single-nucleotide differences between reads and the reference genome indicate germline single-nucleotide variants (SNVs; green circles), somatic SNVs (red circles), or sequencing errors (black diamonds). (a) In a pure tumor sample, a location containing mismatches or single nucleotide substitutions in approximately half of the reads covering the location indicates a heterozygous germline SNV or a heterozygous somatic SNV - assuming that there is no copy number aberration at the locus. Algorithms for detecting SNVs distinguish true SNVs from sequencing errors by requiring multiple reads with the same single-letter substitution to be aligned at the position (gray boxes). (b) As tumor purity decreases, the fraction of reads containing somatic mutations decreases: cancerous and normal cells, and the reads originating from each, are shown in blue and orange, respectively. The number of reads reporting a somatic mutation decrease s with tumor purity, diminishing the s ignal to distinguish true somatic mutations from sequencing errors. In this example, only one heterozy gous somatic SNV and one hetererozygous germline SNV are detected (gray boxes) as the mutation in the middle set of aligned r eads is not distinguishable from sequencing errors.
I am new to this field of Bioinformatics and still trying to grasp the topics. So pardon my ignorance here. This is what I understood from the picture.
- A cancer sample can have both tumor cells and normal cells.
- For a given sample which is 100% made of cancer cells(tumor purity=100%), there could be variation amongst individual cells too. (Intratumour heterogenity).
- When we are sequencing cancer data, we take a tumor sample and produce multiple reads. Now take a single read and align with the reference genome. We get a mismatch at location 7(Say). Now I take another read (with a different starting site) from the same tumor sample and align it with the reference genome. Now we again get the same mismatch at the same location((i.e:-position 7)) of the reference genome. This continues for quite a few number of reads and if in more than half of the reads we get the same mismatch for the same location of the reference genome, then we can conclude that it is a SNV and not a sequencing error. Am I interpreting this correctly?
To show my thoughts more clearly, I am attaching this image.. Notice the SNV at location 7. So the point is in a sample with low tumor purity we would get less reads to determine whether its an SNV or a sequcning mistake.