How to do SNP selection in large VCF/BCF/GenotypeTables in R like you can using BCFTools,CYVCF2 (fast python VCF/BCF parser) or Excel?
Support for at least the following type of filter criteria is required:
- Filtering on Region of Interest
- Filtering on Samples of interest
- Filtering on Variant Quality (QUAL)
- Filtering on Genotype call rate (INFO/AN / (number of samples*2))
- Number of alternative alleles (INFO/AC)
- Filtering on alternative allele frequency (INFO/AF)
- Samples required to HOM_REF,HET,HOM_ALT
- Samples to have contrasting genotypes
Preferred supported filtering criteria:
- Filtering on sample specific or minimal/maximum/average genotype depth
- Filtering on sample specific or minimal/maximum/average genotype quality
- Distance to nearest neighbor variants upstream and downstream
Depending on filter support for genotype depth and genotype quality these VCF/BCF/R specific format files can be quite big.
So compact (small data size) R specific format and multi-threaded filtering might also be useful/required when loading everything to memory on a big machine.
Or streaming parsing of the VCF/BCF file (like BCFtools and CYVCF2 do) .
What would be the best way and library to do SNP selection using the above filter criteria in R?
Is there maybe something like CYVCF2 (python wrapper around HTSLIB VCF parser) for R?
Or maybe a BCFTools like something in R?
Or some other good R library for doing this?
The targeted downstream users are R(studio) users who would like to have this functionality in the environment that they are used to and already use for related tasks.
The VCF can optionally be (pre) converted to TAB delimited format with bcftools query. Or to any specific R format.
A high level / coarse filter on region of interest and samples of interest filter might already have been done. It's often (but not always) more about iterative / dynamic fine grained filtering final selection of SNPs.