Question

Different VCF file extensions - VCF 101

0

Entering edit mode

3.6 years ago

akshaykum684 ▴ 20

Hello Everyone,

I am new to VCF files analysis and bioinformatics domain. Trying to take a step by step approach. I see that we have a set of VCF files like below

PG_American_Indian.1000Gphase3_v5.chr1.dose.vcf.gz
PG_American_Indian.1000Gphase3_v5.chr1.dose.vcf.gz.tbi
PG_American_Indian.1000Gphase3_v5.chr1.info.gz

PG_Australian.1000Gphase3_v5.chr9.dose.vcf.gz
PG_Australian.1000Gphase3_v5.chr9.dose.vcf.gz.tbi
PG_Australian.1000Gphase3_v5.chr9.info.gz

Q1) So how should I interpret the extensions?

vcf.gz - it's the bgzip file that contains the data. I can unzip it and use it for analysis or use a tool that can read bgzip file as-is.
vcf.gz.tbi - I found here that this is an index file corresponding to bgzip file. But may I know what does index file means? What is the use of this file?
info.gz - What does the info.gz file contain? What's the use of it?

Q2) How should I interpret the filenames?

PG_English.1000Gphase3_v5.chr1.dose.vcf.gz
Q3) Does this mean this file contains variant 5 from the 1st chromosome? But this variant 5 from 1st chromosome is for 1000 people (whose genomes were collected?)
Q4) What does dose in the file name indicate?
Q5) Will each VCF file always be accompanied by an index file (.tbi) and info.gz file?

Q6) Why do I see only v5.chr1 for American_Indian whereas I only see v5.chr9 for Australian?

    Does it mean there was no change in other variants (meaning SNPs, INDEL) for American Indians or Australians?

Can I kindly request your help with this, please?

sequencing snp genome gene variant • 940 views

ADD COMMENT • link 3.6 years ago by akshaykum684 ▴ 20

1

Entering edit mode

But may I know what does index file mean

https://pubmed.ncbi.nlm.nih.gov/21208982/ " Tabix: fast retrieval of sequence features from generic TAB-delimited files "

What does the info.gz file contain?

gunzip -c info.gz | more

What's the use of it?

ask the file provider

ADD REPLY • link 3.6 years ago by Pierre Lindenbaum 161k

1

Entering edit mode

v5

I suppose it's the 5th version. But again, you should ask the original file provider.

ADD REPLY • link 3.6 years ago by Pierre Lindenbaum 161k

1

Entering edit mode

1000G : https://www.internationalgenome.org/

ADD REPLY • link 3.6 years ago by Pierre Lindenbaum 161k

1

Entering edit mode

So how should I interpret the extensions?

Those VCF files are bgzipped for a reason. They are indexed with a program called tabix. It allows fast retrieval of data from tab delimited files.The index produced by tabix has the .tbi extension. Those files need to be kept together.

You can take a look inside the info file by doing zcat PG_American_Indian.1000Gphase3_v5.chr1.info.gz.

1000 genome file name conventions are described at this link. Note: HTML formatting appears to be messed up on that page.

ADD REPLY • link 3.6 years ago by GenoMax 141k