Variant Effect And The Impact Of Samples
3
0
Entering edit mode
10.4 years ago
User 1933 ▴ 340

I have extracted a region from 1KG webserver like

mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg19 -N -e  'select concat("ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20110521/ALL.",K.chrom,".phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz"),concat(RIGHT(K.chrom,LENGTH(chrom)-3),":",MIN(K.txStart)+1,"-",MAX(K.txEnd)) from knownGene as K,kgXref as X where K.name=X.kgId and X.geneSymbol="PAX7" '   | xargs ./tabix-0.2.5/tabix -h

I extract a set of individuals,

perl /vcftools_0.1.10/perl/vcf-subset -c HG00254 /RawFiles/PAX7_raw.vcf > PAX7_sub.vcf

I run the Variant_effect_predictor (sift and polyphen). I get X amount of phathogenic variants - meaning the polyphen output is either 'possibly_damaging', 'probably_damaging', 'tolerated' or the SIFT outpout is 'tolerated'.

perl /variant_effect_predictor/variant_effect_predictor.pl -database -i PAX7_sub.vcf -sift b -polyphen p --force_overwrite -o PAX7_vep.txt

I look into the pathogenic variants

grep -E "stop_gained|missense_variant|frameshift" + maindir+geneSymbol+"_vep.txt > " + PAX7_missense_variants.txt

cat PAX7_missense_variants.txt | wc  =  `47     656    7110`

I wonder why the final number of pathogenic variants does not depend on the number of individuals. Here I only did the analysis on "HG00254" if I do it on another individuals the result would be the same ! isn't weird ? what I am doing wrong ?!

UPDATE For this specific genes, PAX7, in this vcf file, which corresponds to only 1 individual HG00254 there are 47 protein changing variants. however, if we do the same analysis on the original set, including all 1093 individuals, again, there are 47 protein changing variants.

for example I am doing the same analysisn on the original PAX7_raw.vcf

perl /variant_effect_predictor/variant_effect_predictor.pl -database -i /RawFiles/PAX7_raw.vcf -sift b -polyphen p --force_overwrite -o PAX7_raw_vep.txt

grep -E "stop_gained|missense_variant|frameshift" + maindir+geneSymbol+"_vep.txt > " + PAX7_raw_missense_variants.txt

and I count number of row - still 47 !

cat PAX7_raw_missense_variants.txt | wc  =  `47     656    7110`
• 2.5k views
ADD COMMENT
0
Entering edit mode
10.4 years ago

Please, check your genotypes GT in the VCF are not set to the null string '.'.

ADD COMMENT
0
Entering edit mode
10.4 years ago
Emily 23k

I'm a little bit confused by your question. Firstly, I think the roles of SIFT and PolyPhen need to be clarified - they do not tell you if a variant is pathogenic, they tell you if it's likely that the amino acid change will have an effect on the protein - this is subtle difference, ie fact vs probability. PolyPhen gives predictions of 'probably damaging' and 'possibly damaging', which means the aa change is likely to have an effect on the protein, or 'benign', which means it won't. SIFT gives predictions of 'tolerated' (unlikely to affect protein) or 'deleterious' (likely to affect protein). There's more information on SIFT and PolyPhen prediction here.

Also, SIFT and PolyPhen only apply to missense variants. They predict the effect of changing an amino acid. They will not give you any information on other variants that may affect protein structure such as stop_gained or frameshift. If you use these to filter your data, you will only get missense variants and will miss out other important protein-affecting variants.

If you're interested in variants that are likely to affect the protein, you should pull out missense variants with a PolyPhen prediction of 'probably damaging' or 'possibly damaging' and/or with a SIFT prediction of 'deleterious'. I think that the selection you are using will probably pull out nearly all the missense variants (as you're getting all the protein-affecting variants with your PolyPhen filter and all the non-protein-affecting variants with your SIFT filter) but miss all the stop_gained and frameshift variants.

You say "the final number of pathogenic variants does not depend on the number of individuals". Can you explain what you mean? Ie, can you give us some output files or numbers?

ADD COMMENT
0
Entering edit mode

Thanks for explaining about the protein-affecting variants. I have added some more code to explain why do I mean

ADD REPLY
0
Entering edit mode
10.4 years ago
Emily 23k

I think the issue is that the 1000 genomes VCF files contain all variants in all individuals. You then do some data slicing to get a region and a single individual. This, however, doesn't affect the number of variants you pull out: you pull them all out and your VCF file indicates what genotype your individual has, even if that is 0/0.

ADD COMMENT

Login before adding your answer.

Traffic: 1454 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6