User friendly (visual&interactive) VCF/BCF mining tools (2021)
1
0
Entering edit mode
5 weeks ago
William ★ 4.9k

What is currently the best user friendly (visual and interactive) VCF/BCF mining tool in 2021? For VCF/BCF similar to size or even larger than the 1000 human genomes VCF?

I guess most organization do not have a visual and interactive mining VCF mining tool but use either:

  1. A website front-end + batch query system back-end, submit your query and wait few minutes to hours to get back results. Maybe get no results back, too many results, or wrong results. And then repeat.
  2. A (junior) bio-informatician that runs a query/few queries on the command line every time a non linux/programming experienced biologist has a question.

I asked this question already around 5 years ago, and wonder what the situation currently is.

So 100M plus variants, 1000+ samples, compressed BCF file size 500G+, uncompressed VCF several TB+

One requirement is that it should do all kinds of filtering that bcftools view does:

http://www.htslib.org/doc/bcftools.html#view

But BCFTools does not meet the interactive and visual requirements. BCFTools is only interactive for small VCF files or when you use the tabix index to limit the query to a small region.

Another requirements if that the filtering is visual and interactive, like for example with a small genotype matrix in Excel. (I know bad idea but at least Excel interactive, visual and biologist friendly).

With interactive I mean that a filter criteria can be adjusted and you semi reall-time (few seconds to 1 minute) get back your updated result genotype matrix. Even for complex queries were the full 100M+ variants for all 1000+ samples should be scanned the tool should be interactive.

Does something like this already exist? If so which tools?

Mostly curious about what open source solution there are, but also curious if there are any commercial solutions?

See also this older question and answers:

Which Type Of Database Systems Are More Appropriate For Storing Information Extracted From Vcf Files

I am/was hoping that nowadays something like the following exists:

  • scalable database (cluster) (e.g. mongodb/spark etc) that stores a large VCF/BCF content; variants and genotypes
  • bcftools view like domain code could do queries
  • results reported (full/paginated or summarized) in a website/fat GUI.
gui vcf • 359 views
ADD COMMENT
0
Entering edit mode

I believe that many use excel or some other software to analyze already annotated VCFs. I know it doesn't really apply to the same type of events, but I recently used jbrowser to analyze a structural variant VCF that has a visual and interactive interface.

ADD REPLY
0
Entering edit mode

Genomebrowsers like Jbrowse/IGV work fine but only for few samples and few variants/regions of interest. Fine if you are already at that level, but not if your still need to get to "small data" (=few regions of interest/few samples of interest).

ADD REPLY
0
Entering edit mode

Okay, maybe work with Hail using Databricks on AWS could be an option in this case.

ADD REPLY
0
Entering edit mode

I have looked at hail in the past, found it can do gwas and pca on large VCF files quick, but not (as far as I know) filter a large VCF file like bcftools view does.

ADD REPLY
1
Entering edit mode
5 weeks ago
sbstevenlee ▴ 240

If you are familiar with Python and the pandas package, I wrote the pyvcf.VcfFrame class which stores VCF data as pandas.DataFrame to allow fast computation and easy manipulation (click here to see the class). It also supports plotting as a bonus. I'm not sure if this approach is robust enough for you -- pyvcf.VcfFrame will be as fast and efficient as pandas.DataFrame gets -- but I once needed to "interactively" play with a lot of VCF data and doing it with CLI such as bcftools was not an option so I just ended up writing the class myself. It's 100% open source and free, and I even invite others to contribute if they can improve it. That being said, I'd be curious to learn what other tools are available to this end.

ADD COMMENT
0
Entering edit mode

Looks interesting, but more for when you already have small or medium data. I often then just stream the VCF/BCF file with CYVCF2 (python htslib wrapper) and some custom code for what I need to do.

ADD REPLY

Login before adding your answer.

Traffic: 2187 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6