Question: What Type Of Information Does The Data Downloaded From The 1000 Genomes Contain?
1
gravatar for Pappu
6.8 years ago by
Pappu1.9k
Pappu1.9k wrote:

With this command, I only get a vcf file containing numbers:

tabix -fh ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20110521/ALL.chr11.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz 11:62379194-62382592 >1000.vcf

Could you tell me what is going on wrong or how to interpret the output (first 11 columns from awk)?

11 62379317 rs181324718 G T 100 PASS ERATE=0.0004;LDAF=0.0013;AA=G;AN=2184;VT=SNP;THETA=0.0006;RSQ=0.7300;SNPSOURCE=LOWCOV;AC=2;AVGPOST=0.9992;AF=0.0009;ASN_AF=0.0035 GT:DS:GL 0|0:0.000:-0.01,-1.79,-5.00 0|0:0.000:-0.04,-1.06,-5.00

11 62379455 rs146374152 C T 100 PASS ERATE=0.0004;AA=C;AN=2184;VT=SNP;RSQ=0.5436;THETA=0.0056;SNPSOURCE=LOWCOV;AC=1;AVGPOST=0.9992;LDAF=0.0008;AF=0.0005;AFR_AF=0.0020 GT:DS:GL 0|0:0.000:-0.10,-0.69,-4.70 0|0:0.000:-0.00,-2.48,-5.00

11 62379545 rs139680986 C T 100 PASS ERATE=0.0004;AA=C;AN=2184;VT=SNP;THETA=0.0006;LDAF=0.0007;SNPSOURCE=LOWCOV;AC=1;RSQ=0.6432;AVGPOST=0.9995;AF=0.0005;ASN_AF=0.0017 GT:DS:GL 0|0:0.000:-0.01,-1.49,-5.00 0|0:0.000:-0.00,-2.59,-5.00
mutation • 2.3k views
ADD COMMENTlink modified 6.8 years ago by Chris Whelan550 • written 6.8 years ago by Pappu1.9k

Can you show us some of the output?

ADD REPLYlink written 6.8 years ago by zam.iqbal.genome1.7k

There should be some # in the "head" of this vcf file to walk you through these values.. normally as one would hope the vcf file will have the variants calls for all the samples in columns. Perhaps a column count, row count and walking through the # should offer some insight.

ADD REPLYlink modified 6.8 years ago • written 6.8 years ago by ngsgene350
6
gravatar for Chris Whelan
6.8 years ago by
Chris Whelan550
Portland, OR
Chris Whelan550 wrote:

This is hard to read because tabs aren't showing up here, but the first few fields describe the variant, a SNP (VT=SNP) at either chr 1 pos 162379317 or chr11 pos 62379317 (depending on the placement of a tab!) The SNP had ID rs181324718 and changes a G in the reference to a T in the alternate alleles.

The numeric values that start occurring after the "GT:DS:GL" are the genotype (GT), downsampling (DS) and genotype likelihood (GL) values for each of the samples in the 1000 Genomes Project. It's hard to see here because you've lost tabs in your formatting; each one is of the format:

0|0:0.000:-0.01,-1.72,-5.000

In this case, 0|0 is the genotype for a particular sample; 0 means the reference allele and 1 means the alternate allele so this sample is homozygous for the reference allele. If the sample was called as heterozygous, for example it might be 1|0 in that field, and homozygous would be 1:1. I don't know the details of the DS value but it has to do with how the call was made and whether they used all of the reads for that sample. The three numbers separated by commas are the log likelihoods of the AA, AB, and BB genotypes.

The header of the VCF should tell you which sample goes with which column.

See http://www.1000genomes.org/node/101 for more details.

ADD COMMENTlink written 6.8 years ago by Chris Whelan550

Hi

This is a great answer. A quick not to say readme file does explain some of these tags

ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20110521/README.phase1_integrated_release_version3_20120430

If you add the option -h to your tabix command you will also get the header which should contain full documentation for all the other tags

ADD REPLYlink modified 6.8 years ago • written 6.8 years ago by Laura1.7k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 739 users visited in the last hour