Question

> 90% samples, DP > 10

0

Entering edit mode

2.9 years ago

Filago ▴ 100

Hello,

I have a number of single-sample-VCF-files and also got for each of them a bed file, consisting of the regions with DP > 10.

What I now want to get is a new bed file, which consists only of regions where 90% of the samples have DP > 10 based on those single bed files. So basically I want a bed file consisting of regions where at least 90% of the primary bed files overlap.

I already know how to work with Granges in R, however I do not know, how to solve the "90%"-thing.

Any suggestions?

Andreas

VCF R DP • 1.1k views

ADD COMMENT • link updated 2.9 years ago by sbstevenlee ▴ 480 • written 2.9 years ago by Filago ▴ 100

0

Entering edit mode

You may want to explore the filter options in bcftools view - that would mean filtering the VCF files on the command line and not in R.

ADD REPLY • link 2.9 years ago by Ram 43k

score 0 · Answer 1 · 2021-05-31

If you have GRanges objects representing regions for each sample, you could call coverage on each of these and then since coverage objects can be added together (cov3 = cov1 + cov2) you could simply add the coverages (you could do this in a loop). After adding all the coverages, you can then simply slice() the coverage object at a depth equal to 90% of your samples. For instance, if you have 10 samples, slice at a depth of 9 to find regions covered by at least 9 samples (90% of your samples). In case individual samples have overlapping regions, you could first reduce them prior to taking coverage(), so that any given sample will only produce a depth of 1 on the genome.

score 0 · Answer 2 · 2021-06-01

0

Entering edit mode

2.9 years ago

Pierre Lindenbaum 161k

convert each VCF file to bed with bcftools query -f '%CHROM\t%POS0\t%END\n'

use bedtools multiinter to get the overlapping intervals and filter the output with awk for the '90%'.

ADD COMMENT • link 2.9 years ago by Pierre Lindenbaum 161k

score 0 · Answer 3 · 2021-06-03

I'm not sure if I fully understood your situation correctly, but you may want to check out the pyvcf submodule I wrote if you are familiar with Python:

from fuc import pyvcf

vcf_files = ['sample1.vcf', 'sample2.vcf', ..., 'sampleN.vcf']
bed_files = ['sample1.bed', 'sample2.bed', ..., 'sampleN.bed']

vf_list = []

for i, vcf_file in enumerate(vcf_files):
    vf = pyvcf.VcfFrame.from_file(vcf_file) # Create a VcfFrame object for each sample.
    filtered_vf = vf.filter_bed(bed_files[i]) # Filter each VcfFrame by the sample's BED file.
    vf_list.append(filtered_vf)

merged_vf = pyvcf.merge(vf_list, how='outer') # Merge all the VcfFrame objects.
final_vf = merged_vf.filter_sampnum(0.9) # Select rows where 90% of the samples have the variant.
bf = final_vf.to_bed() # This creates a BedFrame object.
bf.to_file('final.bed')