Question: reducing coverage of a vcf file
0
gravatar for ucbtsm8
9 months ago by
ucbtsm80
ucbtsm80 wrote:

I have a vcf file which contains information on allele depth (i.e. number of reads which map to the reference and alternate allele):

1   752566  .   G   A   68  .   GG=0,69,68,69,849,556,849,490,556,849;DP=274;AC=1;AN=2  GT:AD:DP:GQ:PL:GG   0/1:1,2:3:19:41,0,19:19,25,0,25,95,50,95,41,50,95

I was wondering whether there is a way (using bcftools for example, rather than some home made script) to reduce the coverage of the vcf to a certain coverage, by removing ref and alt reads? I.e., take a file which has a mean coverage of 40x and reducing the mean coverage to 3x. Obviously I want the PL scores in the INFO field to be adjusted accordingly (hence why I'd rather something like bcftools does it, rather than a home made script).

EDIT: just to say, I don't have access to the original SAM/BAM files, so the action has to be done on the vcf.

Thanks.

snp • 283 views
ADD COMMENTlink modified 9 months ago • written 9 months ago by ucbtsm80

Just a thought: use Picard DownsampleSam to select out 5%, 10%, 20% random reads at the BAM stage, and then re-call variants.

ADD REPLYlink modified 9 months ago • written 9 months ago by Kevin Blighe52k

HI, unfortunately I don't have access to the original bam files, otherwise this would be a good idea! thanks.

ADD REPLYlink written 9 months ago by ucbtsm80

When you go from BAM to VCF, you lose a lot of information. I am not sure how you can simply downsample from the VCF stage - you have no core information on the reads.

ADD REPLYlink written 9 months ago by Kevin Blighe52k

Why do you want to do this?

ADD REPLYlink written 9 months ago by Emily_Ensembl20k

The idea is to take a high coverage individual, downsample it so that there are some SNPs which aren't covered by any reads, impute these missing markers and then compare the imputed calls to the original full sampled calls, in order to test the accuracy of imputation.

ADD REPLYlink modified 9 months ago • written 9 months ago by ucbtsm80

I am not sure that you can do this with just the VCF...

You may try a different approach, like this:

  1. select random variants from your VCF and set all non-selected variants to the missing genotype ./.
  2. impute the missing variants against, for example, 1000 Genomes
  3. compare the imputed variants to the originals
  4. bootstrap Steps 1-3 10x, 20x, 50x, etc.
ADD REPLYlink modified 9 months ago • written 9 months ago by Kevin Blighe52k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1232 users visited in the last hour