reducing coverage of a vcf file
0
0
Entering edit mode
5.2 years ago
ucbtsm8 ▴ 20

I have a vcf file which contains information on allele depth (i.e. number of reads which map to the reference and alternate allele):

1   752566  .   G   A   68  .   GG=0,69,68,69,849,556,849,490,556,849;DP=274;AC=1;AN=2  GT:AD:DP:GQ:PL:GG   0/1:1,2:3:19:41,0,19:19,25,0,25,95,50,95,41,50,95

I was wondering whether there is a way (using bcftools for example, rather than some home made script) to reduce the coverage of the vcf to a certain coverage, by removing ref and alt reads? I.e., take a file which has a mean coverage of 40x and reducing the mean coverage to 3x. Obviously I want the PL scores in the INFO field to be adjusted accordingly (hence why I'd rather something like bcftools does it, rather than a home made script).

EDIT: just to say, I don't have access to the original SAM/BAM files, so the action has to be done on the vcf.

Thanks.

snp • 1.6k views
ADD COMMENT
0
Entering edit mode

Just a thought: use Picard DownsampleSam to select out 5%, 10%, 20% random reads at the BAM stage, and then re-call variants.

ADD REPLY
0
Entering edit mode

HI, unfortunately I don't have access to the original bam files, otherwise this would be a good idea! thanks.

ADD REPLY
0
Entering edit mode

When you go from BAM to VCF, you lose a lot of information. I am not sure how you can simply downsample from the VCF stage - you have no core information on the reads.

ADD REPLY
0
Entering edit mode

Why do you want to do this?

ADD REPLY
0
Entering edit mode

The idea is to take a high coverage individual, downsample it so that there are some SNPs which aren't covered by any reads, impute these missing markers and then compare the imputed calls to the original full sampled calls, in order to test the accuracy of imputation.

ADD REPLY
0
Entering edit mode

I am not sure that you can do this with just the VCF...

You may try a different approach, like this:

  1. select random variants from your VCF and set all non-selected variants to the missing genotype ./.
  2. impute the missing variants against, for example, 1000 Genomes
  3. compare the imputed variants to the originals
  4. bootstrap Steps 1-3 10x, 20x, 50x, etc.
ADD REPLY

Login before adding your answer.

Traffic: 2549 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6