Broad rules of thumb on # of variations
4.9 years ago
andrewl ▴ 10

I'm looking for some broad rules of thumb for how many variations to expect for WGS versus WES.

The idea being that if I am given a VCF of unknown origin and coverage, what does the number of rows in the VCF help tell me about whether it was derived from an WES or WGS sequence.

Or are there any other signals to look for to be able to easily determine this?

variations DNA
Well, you could check whether most of the variants are distributed in exonic regions or all over the genome irrespective of exons. That could already give you an idea. Download GTF of your organism and extract exon regions and intersect your variants with extracted exons.

some broad rules of thumb for how many variations to expect for WGS versus WES

The number pretty much depends on the depth but one can expect more variants from WGS than WES.

I'm looking for something that doesn't require actually looking at the variants. Of the few WGS VCFs I have see, they tend to have 4-5M rows, so I would have though the WES which read ~1% of the genome would have on average around 40-50K variations. Can I not just count rows and draw a conclusion based on whether there are closer to 50K rows or 5M rows. Is there a flaw in this thinking?

WES may have sufficient depth outside the baited areas to call variants. So, if you just restrict to variants within baited areas, the density should be similar. But generally, if I had a bunch of WES and WGS VCF files, I'd expect the WES ones to be much smaller. Maybe on the order of 1/100th the size. I'd be surprised to see one even 1/10th the size of a WGS VCF.

4.9 years ago
poisonAlien ★ 3.1k

It depends, if your WGS vcf file is unfiltered and contains all raw variants, you would expect at-least over 2 million variants. But if it has already been filtered to contain only variants within coding part of the genome, number of variants should be equal to that of WXS.

Regarding WXS, it again depends on the source. For example in cancers, liquid tumors (e.g; leukemia) have way too less mutations ( average 50K, unfiltered) whereas solid tumors (e.g; esophageal or liver) would have around 250,000.

But I think its safe to say if your vcf has over a million variants, its probably from WGS.

4.9 years ago
H.Hasani ▴ 990

I think it is quite problematic to think of SNPs as a number; annotating the variants or compute basic overlap with the annotation is the scientific methodology to do it. Numbers can rely on many factors as everybody pointed out. I'm afraid, this way of thinking will consume your time instead of spending it coming with solid evidence of your file's origin.