Hello Everyone,
I am new to VCF files analysis and bioinformatics domain. Trying to take a step by step approach. I see that we have a set of VCF files like below
PG_American_Indian.1000Gphase3_v5.chr1.dose.vcf.gz
PG_American_Indian.1000Gphase3_v5.chr1.dose.vcf.gz.tbi
PG_American_Indian.1000Gphase3_v5.chr1.info.gz
PG_Australian.1000Gphase3_v5.chr9.dose.vcf.gz
PG_Australian.1000Gphase3_v5.chr9.dose.vcf.gz.tbi
PG_Australian.1000Gphase3_v5.chr9.info.gz
Q1) So how should I interpret the extensions?
vcf.gz - it's the bgzip file that contains the data. I can unzip it and use it for analysis or use a tool that can read bgzip file as-is.
vcf.gz.tbi - I found here that this is an index file corresponding to bgzip file. But may I know what does index file means? What is the use of this file?
info.gz - What does the info.gz file contain? What's the use of it?
Q2) How should I interpret the filenames?
PG_English.1000Gphase3_v5.chr1.dose.vcf.gz
Q3) Does this mean this file contains variant 5 from the 1st chromosome? But this variant 5 from 1st chromosome is for 1000 people (whose genomes were collected?)
Q4) What does dose
in the file name indicate?
Q5) Will each VCF
file always be accompanied by an index file (.tbi
) and info.gz
file?
Q6) Why do I see only v5.chr1
for American_Indian
whereas I only see v5.chr9
for Australian
?
Does it mean there was no change in other variants (meaning SNPs, INDEL) for American Indians or Australians?
Can I kindly request your help with this, please?
https://pubmed.ncbi.nlm.nih.gov/21208982/ " Tabix: fast retrieval of sequence features from generic TAB-delimited files "
ask the file provider
v5
I suppose it's the 5th version. But again, you should ask the original file provider.
1000G
: https://www.internationalgenome.org/Those VCF files are bgzipped for a reason. They are indexed with a program called
tabix
. It allows fast retrieval of data from tab delimited files.The index produced bytabix
has the.tbi
extension. Those files need to be kept together.You can take a look inside the
info
file by doingzcat PG_American_Indian.1000Gphase3_v5.chr1.info.gz
.1000 genome file name conventions are described at this link. Note: HTML formatting appears to be messed up on that page.