Extract below 100% overlapping intervals among samples from WGS data
1
0
Entering edit mode
3 months ago
FL512 • 0

I am thinking about how I can extract shared overlap interval from WGS data with arbitrary percentage.

According to the bedtools document, overlapping intervals can be extracted. https://bedtools.readthedocs.io/en/latest/content/tools/intersect.html This is very useful and working well for me if I have a few samples.

However, I am analyzing several hundreds of samples, ended in no overlapped interval detected. This is understandable, let's say if 99 samples have T/A variant on the Chr1 position 1 but 1 sample does not have it, it results in no shared overlap interval. To overcome this situation, I would liked to extract variants that are overlapped in more than 99% among samples, 95%, 90% or even less, until I can find the overlapping intervals.

Does anyone know how to do it or could you please let me know the helpful websites? Or maybe GATK SelectVariants is doable?

Thank you!

WGS bedtools GATK • 156 views
1
Entering edit mode
3 months ago

filter on samtools depth+bed and then use the bed to filter the vcf

samtools depth S*.bam | awk '{N=0;for(i=3;i<=NF;i++) {if(int($i)>0) N+=1.0;} if((N/(NF-2)) >= 0.9) printf("%s\t%d\t%s\n",$1,int($2)-1,$2);}' | bedtools merge

RF01    10  3295
RF02    20  2668
RF03    9   2585
RF04    21  2352
RF05    15  1565
RF06    31  1348
RF07    12  1063
RF08    8   1056
RF09    11  1036
RF10    6   340
RF10    397 741
RF11    2   272
RF11    390 663

0
Entering edit mode

Dear Pierre, Thank you very much for your quick & kind response. I appreciate it. I will give it a try tonight and let you know the results.