Question: Broad rules of thumb on # of variations
0
gravatar for andrewl
4 months ago by
andrewl10
andrewl10 wrote:

I'm looking for some broad rules of thumb for how many variations to expect for WGS versus WES.

The idea being that if I am given a VCF of unknown origin and coverage, what does the number of rows in the VCF help tell me about whether it was derived from an WES or WGS sequence.

Or are there any other signals to look for to be able to easily determine this?

dna variations • 235 views
ADD COMMENTlink modified 4 months ago by H.Hasani580 • written 4 months ago by andrewl10

Well, you could check whether most of the variants are distributed in exonic regions or all over the genome irrespective of exons. That could already give you an idea. Download GTF of your organism and extract exon regions and intersect your variants with extracted exons.

some broad rules of thumb for how many variations to expect for WGS versus WES

The number pretty much depends on the depth but one can expect more variants from WGS than WES.

ADD REPLYlink modified 4 months ago • written 4 months ago by venu4.3k

I'm looking for something that doesn't require actually looking at the variants. Of the few WGS VCFs I have see, they tend to have 4-5M rows, so I would have though the WES which read ~1% of the genome would have on average around 40-50K variations. Can I not just count rows and draw a conclusion based on whether there are closer to 50K rows or 5M rows. Is there a flaw in this thinking?

ADD REPLYlink written 4 months ago by andrewl10

WES may have sufficient depth outside the baited areas to call variants. So, if you just restrict to variants within baited areas, the density should be similar. But generally, if I had a bunch of WES and WGS VCF files, I'd expect the WES ones to be much smaller. Maybe on the order of 1/100th the size. I'd be surprised to see one even 1/10th the size of a WGS VCF.

ADD REPLYlink modified 4 months ago • written 4 months ago by Brian Bushnell14k
2
gravatar for poisonAlien
4 months ago by
poisonAlien2.4k
Asgard
poisonAlien2.4k wrote:

It depends, if your WGS vcf file is unfiltered and contains all raw variants, you would expect at-least over 2 million variants. But if it has already been filtered to contain only variants within coding part of the genome, number of variants should be equal to that of WXS.

Regarding WXS, it again depends on the source. For example in cancers, liquid tumors (e.g; leukemia) have way too less mutations ( average 50K, unfiltered) whereas solid tumors (e.g; esophageal or liver) would have around 250,000.

But I think its safe to say if your vcf has over a million variants, its probably from WGS.

ADD COMMENTlink written 4 months ago by poisonAlien2.4k
0
gravatar for H.Hasani
4 months ago by
H.Hasani580
Freiburg, Germany
H.Hasani580 wrote:

I think it is quite problematic to think of SNPs as a number; annotating the variants or compute basic overlap with the annotation is the scientific methodology to do it. Numbers can rely on many factors as everybody pointed out. I'm afraid, this way of thinking will consume your time instead of spending it coming with solid evidence of your file's origin.

ADD COMMENTlink modified 4 months ago • written 4 months ago by H.Hasani580
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1440 users visited in the last hour