3 months ago
FL512 • 0

I am thinking about how I can extract shared overlap interval from WGS data with arbitrary percentage.

According to the bedtools document, overlapping intervals can be extracted. https://bedtools.readthedocs.io/en/latest/content/tools/intersect.html This is very useful and working well for me if I have a few samples.

However, I am analyzing several hundreds of samples, ended in no overlapped interval detected. This is understandable, let's say if 99 samples have T/A variant on the Chr1 position 1 but 1 sample does not have it, it results in no shared overlap interval. To overcome this situation, I would liked to extract variants that are overlapped in more than 99% among samples, 95%, 90% or even less, until I can find the overlapping intervals.

Does anyone know how to do it or could you please let me know the helpful websites? Or maybe GATK SelectVariants is doable?

Thank you!

3 months ago

filter on samtools depth+bed and then use the bed to filter the vcf

samtools depth S*.bam | awk '{N=0;for(i=3;i<=NF;i++) {if(int($i)>0) N+=1.0;} if((N/(NF-2)) >= 0.9) printf("%s\t%d\t%s\n",$1,int($2)-1,$2);}' | bedtools merge

RF01    10  3295
RF02    20  2668
RF03    9   2585
RF04    21  2352
RF05    15  1565
RF06    31  1348
RF07    12  1063
RF08    8   1056
RF09    11  1036
RF10    6   340
RF10    397 741
RF11    2   272
RF11    390 663
Dear Pierre, Thank you very much for your quick & kind response. I appreciate it. I will give it a try tonight and let you know the results.


