Hello guys, I am working with rates of variation (expressed in SNP/Mbp) and I'm observing something which biologically doesn't have much sense.
I have two variant call sets: one contains calls on the whole genome, the other contains an subset of it, made on the coding regions only (CDS lines in GFF format, so to say).
I calculate the density by taking the number of SNPs identified and dividing it by the positions covered by the reads used to call such SNPs: e.g. if I have 100 SNPs identified within coding regions, and I have coverage on 1 Mbp of the coding region positions, I will have 100 SNPs / 1 Mbp.
Biological expectation: higher density of variants when looking at the whole genome, than when looking at the CDS only. Results: the opposite (even though not a big difference).
I thought that this might be due to the fact that I work with a non-model organism for which the genome is not of high quality. So I repeated the pipeline with A. thaliana and there it showed a higher density of variants on the whole genome and a lower one in the CDS (as expected).
Do you know what could have generated this weird result? Is the genome assembly quality your best guess as culprit? I am running out of options to test ^^