Question

reducing coverage of a vcf file

0

Entering edit mode

5.2 years ago

ucbtsm8 ▴ 20

I have a vcf file which contains information on allele depth (i.e. number of reads which map to the reference and alternate allele):

1   752566  .   G   A   68  .   GG=0,69,68,69,849,556,849,490,556,849;DP=274;AC=1;AN=2  GT:AD:DP:GQ:PL:GG   0/1:1,2:3:19:41,0,19:19,25,0,25,95,50,95,41,50,95

I was wondering whether there is a way (using bcftools for example, rather than some home made script) to reduce the coverage of the vcf to a certain coverage, by removing ref and alt reads? I.e., take a file which has a mean coverage of 40x and reducing the mean coverage to 3x. Obviously I want the PL scores in the INFO field to be adjusted accordingly (hence why I'd rather something like bcftools does it, rather than a home made script).

EDIT: just to say, I don't have access to the original SAM/BAM files, so the action has to be done on the vcf.

Thanks.

snp • 1.6k views

ADD COMMENT • link 5.2 years ago by ucbtsm8 ▴ 20

0

Entering edit mode

Just a thought: use Picard DownsampleSam to select out 5%, 10%, 20% random reads at the BAM stage, and then re-call variants.

ADD REPLY • link 5.2 years ago by Kevin Blighe 87k

0

Entering edit mode

HI, unfortunately I don't have access to the original bam files, otherwise this would be a good idea! thanks.

ADD REPLY • link 5.2 years ago by ucbtsm8 ▴ 20

0

Entering edit mode

When you go from BAM to VCF, you lose a lot of information. I am not sure how you can simply downsample from the VCF stage - you have no core information on the reads.

ADD REPLY • link 5.2 years ago by Kevin Blighe 87k

0

Entering edit mode

Why do you want to do this?

ADD REPLY • link 5.2 years ago by Emily 23k

0

Entering edit mode

The idea is to take a high coverage individual, downsample it so that there are some SNPs which aren't covered by any reads, impute these missing markers and then compare the imputed calls to the original full sampled calls, in order to test the accuracy of imputation.

ADD REPLY • link 5.2 years ago by ucbtsm8 ▴ 20

0

Entering edit mode

I am not sure that you can do this with just the VCF...

You may try a different approach, like this:

select random variants from your VCF and set all non-selected variants to the missing genotype ./.
impute the missing variants against, for example, 1000 Genomes
compare the imputed variants to the originals
bootstrap Steps 1-3 10x, 20x, 50x, etc.

ADD REPLY • link 5.2 years ago by Kevin Blighe 87k