Question: How to do SNP selection in large VCF/BCF/GenotypeTables in R like you can using BCFTools,CYVCF2 or Excel
1
gravatar for William
13 months ago by
William4.4k
Europe
William4.4k wrote:

How to do SNP selection in large VCF/BCF/GenotypeTables in R like you can using BCFTools,CYVCF2 (fast python VCF/BCF parser) or Excel?

Support for at least the following type of filter criteria is required:

  • Filtering on Region of Interest
  • Filtering on Samples of interest
  • Filtering on Variant Quality (QUAL)
  • Filtering on Genotype call rate (INFO/AN / (number of samples*2))
  • Number of alternative alleles (INFO/AC)
  • Filtering on alternative allele frequency (INFO/AF)
  • Samples required to HOM_REF,HET,HOM_ALT
  • Samples to have contrasting genotypes

Preferred supported filtering criteria:

  • Filtering on sample specific or minimal/maximum/average genotype depth
  • Filtering on sample specific or minimal/maximum/average genotype quality
  • Distance to nearest neighbor variants upstream and downstream

Depending on filter support for genotype depth and genotype quality these VCF/BCF/R specific format files can be quite big.

So compact (small data size) R specific format and multi-threaded filtering might also be useful/required when loading everything to memory on a big machine.

Or streaming parsing of the VCF/BCF file (like BCFtools and CYVCF2 do) .

What would be the best way and library to do SNP selection using the above filter criteria in R?

Is there maybe something like CYVCF2 (python wrapper around HTSLIB VCF parser) for R?

Or maybe a BCFTools like something in R?

Or some other good R library for doing this?

The targeted downstream users are R(studio) users who would like to have this functionality in the environment that they are used to and already use for related tasks.

The VCF can optionally be (pre) converted to TAB delimited format with bcftools query. Or to any specific R format.

A high level / coarse filter on region of interest and samples of interest filter might already have been done. It's often (but not always) more about iterative / dynamic fine grained filtering final selection of SNPs.

filtering snp R vcf • 756 views
ADD COMMENTlink modified 13 months ago by Sean Davis25k • written 13 months ago by William4.4k
1

Why do you need to use R? I would never recommend anyone to do VCF filtering in R, or Excel.

ADD REPLYlink written 13 months ago by Kevin Blighe41k

Targeted downstream users are R(studio) users who would like to have this functionality in the environment that they are used to and already use for related tasks.

ADD REPLYlink written 13 months ago by William4.4k

Okay, I don't know of any tools because it's not something about which I have thought. Maybe others have ideas.

From my perspective, you could use system calls from within R, i.e., system calls that call BCFtools, etc., and then wrap these as functions for your users. This assumes that they use R Server or R on a linux cluster.

I'm sure that Pierre has some ideas here, when he next checks in

ADD REPLYlink modified 13 months ago • written 13 months ago by Kevin Blighe41k

Also remember that a VCF is essentiall just a tab-separated file, apart from the header. Parsing it should be easy in R, but I just worry about speed.

ADD REPLYlink written 13 months ago by Kevin Blighe41k

the VCF can optionally be converted to TAB delimited format with bcftools query. Or to any specific R format.

ADD REPLYlink written 13 months ago by William4.4k

Should that not suffice, in that case? You may look up the MAF variant format, in that case, just to stick to convention. The MAF format was developed in light of the TCGA deluge of data, as far as I know.

I would always add a unique identifier for each variant, too, consisting of sample:chr:pos:ref:alt (i.e. used as rownames in a R data-frame)

ADD REPLYlink written 13 months ago by Kevin Blighe41k

Just load a TAB file as a data frame is certainly an option. I was hoping there would be a R library that would support:

  • working on larger genotype tabels than can be loaded as a data-frame
  • be more user friendly, i.e. offer pre-defined domain specific filters like for example bcftools does (instead of just filtering on strings and integers)
ADD REPLYlink written 13 months ago by William4.4k

Sometimes the input VCF already has been filtered for regions and samples of interest. So it's now always very big (i.e. could be opened in Excel if you really wanted to) , but sometimes it still is very big (more suited for streaming processing with bcftools/cyvcf2).

ADD REPLYlink written 13 months ago by William4.4k

Well, I worked with a colleague in California who always opened my VCFs in Excel, but therefore I had to limit the variants to < ~1,000,000 due to limits in Excel. Although frowned upon, it's not always possible to keep everything within the coding environment, nor is it practical.

ADD REPLYlink written 13 months ago by Kevin Blighe41k

see the previous question on biostars and the examples for my tool vcffilterjdk : http://lindenb.github.io/jvarkit/VcfFilterJdk.html

ADD REPLYlink written 13 months ago by Pierre Lindenbaum118k

I was looking for something R specific Pierre. VcfFilterJdk is as R specific as writing an CYVCF2 python script and then executing that via R system2 call to the CYVCF2 script.

ADD REPLYlink written 13 months ago by William4.4k
2
gravatar for Sean Davis
13 months ago by
Sean Davis25k
National Institutes of Health, Bethesda, MD
Sean Davis25k wrote:

Have a look at the VariantAnnotation package in Bioconductor.

https://bioconductor.org/packages/release/bioc/html/VariantAnnotation.html

And the vignette on filtering VCF files.

https://bioconductor.org/packages/release/bioc/vignettes/VariantAnnotation/inst/doc/filterVcf.pdf

Note that this may not be the fastest option due to the parsing into VCF object in R, but it can be a useful adjunct to "system" calls (remember, you can wrap command-line utilities in R functions to make them "appear" to R users as R functions).

ADD COMMENTlink modified 13 months ago • written 13 months ago by Sean Davis25k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1258 users visited in the last hour