> 90% samples, DP > 10
3
0
Entering edit mode
4 months ago

Hello,

I have a number of single-sample-VCF-files and also got for each of them a bed file, consisting of the regions with DP > 10.

What I now want to get is a new bed file, which consists only of regions where 90% of the samples have DP > 10 based on those single bed files. So basically I want a bed file consisting of regions where at least 90% of the primary bed files overlap.

I already know how to work with Granges in R, however I do not know, how to solve the "90%"-thing.

Any suggestions?

Andreas

VCF R DP • 387 views
ADD COMMENT
0
Entering edit mode

You may want to explore the filter options in bcftools view - that would mean filtering the VCF files on the command line and not in R.

ADD REPLY
0
Entering edit mode
4 months ago
seidel 8.3k

If you have GRanges objects representing regions for each sample, you could call coverage on each of these and then since coverage objects can be added together (cov3 = cov1 + cov2) you could simply add the coverages (you could do this in a loop). After adding all the coverages, you can then simply slice() the coverage object at a depth equal to 90% of your samples. For instance, if you have 10 samples, slice at a depth of 9 to find regions covered by at least 9 samples (90% of your samples). In case individual samples have overlapping regions, you could first reduce them prior to taking coverage(), so that any given sample will only produce a depth of 1 on the genome.

ADD COMMENT
0
Entering edit mode
4 months ago

convert each VCF file to bed with bcftools query -f '%CHROM\t%POS0\t%END\n'

use bedtools multiinter to get the overlapping intervals and filter the output with awk for the '90%'.

ADD COMMENT
0
Entering edit mode
4 months ago
sbstevenlee ▴ 230

I'm not sure if I fully understood your situation correctly, but you may want to check out the pyvcf submodule I wrote if you are familiar with Python:

from fuc import pyvcf

vcf_files = ['sample1.vcf', 'sample2.vcf', ..., 'sampleN.vcf']
bed_files = ['sample1.bed', 'sample2.bed', ..., 'sampleN.bed']

vf_list = []

for i, vcf_file in enumerate(vcf_files):
    vf = pyvcf.VcfFrame.from_file(vcf_file) # Create a VcfFrame object for each sample.
    filtered_vf = vf.filter_bed(bed_files[i]) # Filter each VcfFrame by the sample's BED file.
    vf_list.append(filtered_vf)

merged_vf = pyvcf.merge(vf_list, how='outer') # Merge all the VcfFrame objects.
final_vf = merged_vf.filter_sampnum(0.9) # Select rows where 90% of the samples have the variant.
bf = final_vf.to_bed() # This creates a BedFrame object.
bf.to_file('final.bed')
ADD COMMENT

Login before adding your answer.

Traffic: 2016 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6