Haplotype annotation in VCF file of phase 3 1000 Genome project
1
0
Entering edit mode
2.5 years ago
caggtaagtat ★ 1.5k

Hi there,

I'm new to vcf file analysis and would like to download a huge database for human SNPs with information about the location, sequence variation and if it is possible to be homozygous.

So far I found this directory for files of the 1000 genome project where I think I can download the relevant data. However, I'm not sure if I look at the right columns.

The data looks like this:

22      16050654        esv3647175;esv3647176;esv3647177;esv3647178     A       <CN0>,<CN2>,<CN3>,<CN4> 100     PASS    AC=9,87,599,20;AF=0.00179712,0.0173722,0.119609,0.00399361;AN=5008;CS=DUP_gs;END=16063474;NS=2504;SVTYPE=CNV;DP=22545;EAS_AF=0.001,0.0169,0.2361,0.0099;AMR_AF=0,0.0101,0.219,0.0072;AFR_AF=0.0061,0.0363,0.0053,0;EUR_AF=0,0.007,0.0944,0.003;SAS_AF=0,0.0082,0.1094,0.002;VT=SV       GT      3|0     0|0     0|0     0|0     0|0     0|0     0|4     0|0     0|0     0|3     0|0     0|0     0|0     0|0     0|3     0|0     0|0     0|0     0|0     0|0     0|0         0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     3|0     0|0     3|0     0|0     0|0     3|0     0|0     0|0     0|0     0|0     3|0     0|0     0|0     0|0     0|3     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|3     0|0     0|4     0|0     0|0     0|0     3|0     0|0     0|0     0|0     0|3     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     0|0     3|0     0|0     0|0     0|0     0|0     3|0     0|0     0|0     0|3     3|0     0|3     2|0     0|0     0|0     ...


Other Entries only show 0|0, 0|1 1|0, so I initially thought the numbers would indicate the haplotype of the SNP in different individuals. However, I don't understand the difference between 0|2 and 3|0 then.

Edit: I have to add, that there is no documentation of these columns in the vcf file header

SNP VCF Haplotype • 741 views
3
Entering edit mode
2.5 years ago

Hello,

the numbers describe which REF or ALTs are present in the sample. 0 means a REF base and values greater indicates the position in the ALT column.

So a sample with a genotype 0|0 is homozygous for the reference allel. A sample with 0|2 have one reference allele and the second allele correspond to the second value in the ALT column. A sample with 0|3 have one reference allele and the second allele correspond to the 3 value in the ALT column.

The | indicates that the variants are phased. So all variants of the same chromosome assigned in front of the | are located on the same allele and those behind on the other. If phasing is unknown the delimiter would be /.

fin swimmer

0
Entering edit mode

Thank you very much! That helps a lot. So when I'm looking for SNPs which can occur homozygous, I would check for at least one entry with n|n or n/n with n > 0 ?

1
Entering edit mode

So when I'm looking for SNPs which can occur homozygous, I would check for at least one entry with n|n or n/n with n > 0

Yes.