Let's say that some whole genome sample was sequenced with a coverage of 30x. As far as i'm aware, this means that, with respect to the reference genomes' nucleotides, the data represents each nucleotide 30 times on average.
Let's also say that the tissue sample was heterozygous for some loci, where the frequency of the two alleles are both 0.5. Does this mean that coverage for each of these locations are, in effect 15x? I.e if you aligned the data (and it aligned correctly), you would expect to see ~15 reads with allele 1 and ~15 with allele 2.
I ask because I am trying to make a simulated cancer genomics dataset. For this I am using ART, and have "mutated" the hg19.fa file, by introducing some point mutations. This mutated file with represent one haploid set, whilst the non-mutated hg19.fa file will represent the other haploid set; this should add realistic point mutations, which are usually heterozygous in nature.
I then plan to sequence at 30x, so, I was going to run ART for each file at 15x and then combine to get 30x. Any thoughts?