Question

User friendly (visual&interactive) VCF/BCF mining tools (2021)

2

Entering edit mode

2.6 years ago

William ★ 5.3k

What is currently the best user friendly (visual and interactive) VCF/BCF mining tool in 2021? For VCF/BCF similar to size or even larger than the 1000 human genomes VCF?

I guess most organization do not have a visual and interactive mining VCF mining tool but use either:

A website front-end + batch query system back-end, submit your query and wait few minutes to hours to get back results. Maybe get no results back, too many results, or wrong results. And then repeat.
A (junior) bio-informatician that runs a query/few queries on the command line every time a non linux/programming experienced biologist has a question.

I asked this question already around 5 years ago, and wonder what the situation currently is.

So 100M plus variants, 1000+ samples, compressed BCF file size 500G+, uncompressed VCF several TB+

One requirement is that it should do all kinds of filtering that bcftools view does:

http://www.htslib.org/doc/bcftools.html#view

But BCFTools does not meet the interactive and visual requirements. BCFTools is only interactive for small VCF files or when you use the tabix index to limit the query to a small region.

Another requirements if that the filtering is visual and interactive, like for example with a small genotype matrix in Excel. (I know bad idea but at least Excel interactive, visual and biologist friendly).

With interactive I mean that a filter criteria can be adjusted and you semi reall-time (few seconds to 1 minute) get back your updated result genotype matrix. Even for complex queries were the full 100M+ variants for all 1000+ samples should be scanned the tool should be interactive.

Does something like this already exist? If so which tools?

Mostly curious about what open source solution there are, but also curious if there are any commercial solutions?

See also this older question and answers:

Which Type Of Database Systems Are More Appropriate For Storing Information Extracted From Vcf Files

I am/was hoping that nowadays something like the following exists:

scalable database (cluster) (e.g. mongodb/spark etc) that stores a large VCF/BCF content; variants and genotypes
bcftools view like domain code could do queries
results reported (full/paginated or summarized) in a website/fat GUI.

gui vcf • 2.6k views

ADD COMMENT • link updated 17 months ago by jena ▴ 290 • written 2.6 years ago by William ★ 5.3k

0

Entering edit mode

I believe that many use excel or some other software to analyze already annotated VCFs. I know it doesn't really apply to the same type of events, but I recently used jbrowser to analyze a structural variant VCF that has a visual and interactive interface.

ADD REPLY • link 2.6 years ago by desouzareis.r ▴ 280

0

Entering edit mode

Genomebrowsers like Jbrowse/IGV work fine but only for few samples and few variants/regions of interest. Fine if you are already at that level, but not if your still need to get to "small data" (=few regions of interest/few samples of interest).

ADD REPLY • link 2.6 years ago by William ★ 5.3k

1

Entering edit mode

Okay, maybe work with Hail using Databricks on AWS could be an option in this case.

ADD REPLY • link 2.6 years ago by desouzareis.r ▴ 280

0

Entering edit mode

I have looked at hail in the past, found it can do gwas and pca on large VCF files quick, but not (as far as I know) filter a large VCF file like bcftools view does.

ADD REPLY • link 2.6 years ago by William ★ 5.3k

0

Entering edit mode

I just found the BGT tool (by Heng Li): https://github.com/lh3/bgt. It's not visual, but seems to allow for very flexible queries. Edit: actually it has a web interface.

Of the visual tools I've only found VIVA (written in Julia), which I mentioned below. It is still under development, but it looks promising.

ADD REPLY • link 17 months ago by jena ▴ 290

score 1 · Answer 1 · 2021-09-16

If you are familiar with Python and the pandas package, I wrote the pyvcf.VcfFrame class which stores VCF data as pandas.DataFrame to allow fast computation and easy manipulation (click here to see the class). It also supports plotting as a bonus. I'm not sure if this approach is robust enough for you -- pyvcf.VcfFrame will be as fast and efficient as pandas.DataFrame gets -- but I once needed to "interactively" play with a lot of VCF data and doing it with CLI such as bcftools was not an option so I just ended up writing the class myself. It's 100% open source and free, and I even invite others to contribute if they can improve it. That being said, I'd be curious to learn what other tools are available to this end.

score 1 · Answer 2 · 2022-04-01

Thanks for the question. I am currently dealing with similar problem and haven't found a fully satisfying solution either. But I can share some tips.

I used to do some population-genomics analyses on a non-model organism, using RADseq and exome capture (WES) data for up to a few hundred samples. The uncompressed VCFs were maybe close to 1GB. These days, I work on human WGS datasets like SGDP, HGDP, and soon perhaps even the recent 30x resequencing of the g1k project data. These datasets have close to 100M sites and a some 930 samples. The uncompressed files are a few hundred GBs, compressed HGDP is just under 200GB.

I tried a few different approaches and tools to tackle some questions:

vcfR is a simple modern R library for working with VCF data. It is great with smaller data, like those from RADseq or WES, but I would not try it on my current HGDP dataset, even on a cluster. Any operation would take half an hour or more. However it can read VCF files in chunks, so it may work for you in some way.
plink can take VCF as input (with a --vcf flag) and do some analyses, like LD-based prunning and PCA or MDS. The new version (1.9 and 2) is very fast. If you have done all the quality-based filtering and have a final dataset, you could convert it to plink format and use it with admixtools2 (another R package) to get some interesting features out of the dataset, such as per-population allele frequency spectra, f-statistics and so on.
bcftools query allows some manipulations and filtering of VCFs, while the output can be custom-defined. tabix can help with reducing the data to some region of interest. awk or mawk (faster awk) can help with further processing of the queries. Overall, bcftools is good for filtering, but not for sub-setting the data (at least I did not find a way yet). This can sometimes be solved by tabix, but not always. And tabix works on BGZF-compressed VCFs, not on BCFs.
bgzip by itself can be useful. It can decompress files with multithreading (--threads INT), similar to pigz, and pipe the output (with -dc) to e.g. awk for queries that cannot be done with bcftools.
BGZF-compressed VCF files can be indexed with both tabix and grabix, which can then provide random access to the VCF. While tabix extracts data based on genome coordinates, grabix instead uses file coordinates (line numbers) to extract data.

I'm also currently looking into Julia language and its ecosystem, but I haven't gotten very far yet (too busy). I know they have some packages for reading VCFs and doing some population genetics (mostly modelled after adegenet package in R, which btw could be also useful. Or pegas).

So overall I mostly use bcftools, awk and plink, with help of tools like bgzip and tabix or grabix. Not exactly what you asked for, but this is how it is for now (as far as I know).

But maybe somebody else can show us better ways.

score 1 · Answer 3 · 2022-04-01

1

Entering edit mode

2.1 years ago

Pierre Lindenbaum 161k

I wrote vcf2table: http://lindenb.github.io/jvarkit/VcfToTable.html

SwingVcfView https://lindenb.github.io/jvarkit/SwingVcfView.html

(....)

ADD COMMENT • link 2.1 years ago by Pierre Lindenbaum 161k