Question: show vcf data in a table
0
gravatar for sarah.k
2.8 years ago by
sarah.k0
sarah.k0 wrote:

Hi, everybody. I have a huge vcf file (http://1001genomes.org/data/GMI-MPI/releases/v3.1/1001genomes_snp-short-indel_with_tair10_only_ACGTN.vcf.gz). This is a 132GB compressed VCF file (1135 accessions x 117 million SNPs+short Indels). I want to convert it to sql or csv data format for data modeling in nosql databases such as apache cassandra or mongodb. But before that, I wanted to get a better understanding of its structure. Are there any tools to show this data in a table?

Thank you,

sql nosql csv vcf • 3.4k views
ADD COMMENTlink modified 2.8 years ago by Petr Ponomarenko2.6k • written 2.8 years ago by sarah.k0
1

Note that VCF is already a table format (tsv) when it's uncompressed so you may not necessarily need to convert it to csv for example. Since the file is so large, you might take a subset of it to view. For example, if you are on linux you can get a small sample of the file using something like gunzip -c largefile.vcf.gz | head -n 1000 > largefile_subset.vcf and then open largefile_subset.vcf in a spreadsheet (excel can even open this small it probably :))

ADD REPLYlink written 2.8 years ago by cmdcolin1.3k
1

vcf2tsv from vcflib. There are several small scripts that can convert vcf information to tables.

ADD REPLYlink written 2.8 years ago by cpad011212k
1

If you haven't seen Hail yet, it might be helpful for your end goal. It can load VCF files for Apache Spark analysis.

ADD REPLYlink written 2.8 years ago by Robert Sicko590
3
gravatar for Petr Ponomarenko
2.8 years ago by
United States / Los Angeles / ALAPY.com
Petr Ponomarenko2.6k wrote:

VCF format is described here https://samtools.github.io/hts-specs/VCFv4.3.pdf As you can see there are genotype columns the data structure in which is specified in the format column and moreover info column can have different data between rows. So to parse it deeply you may need a good vcf parser like VariantsToTable part of GATK

Before uploading this dataset into a database you may want to filter it. You can use vcftools for this. Splitting to chromosomes will help to speed up the queries. You can calculate lots of different statistics measures to understand your data better using vcftools as well.

ADD COMMENTlink modified 2.8 years ago • written 2.8 years ago by Petr Ponomarenko2.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1562 users visited in the last hour