develop model file after segregating vcf data into different ethnicity subgroup
0
0
Entering edit mode
18 months ago
rheab1230 ▴ 140

Hello everyone, I have genotype(vcf) and gene expression file. I want to separate my genotype file based on different subpopulation and use it to train a model to generate model file for each population. I am not able to understand how to separate my vcf file based on samples coming from different population? Also, is there any package that can generate model file by grouping samples coming from same population /ethnicity together by learning it from the data and grouping them based on different ethnicity and the do elastic net training and coss validation Thank you.

This is how my vcf file look like:

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FILTER=<ID=VQSRTrancheSNP99.80to99.90,Description="Truth sensitivity tranche level for SNP model at VQS Lod: -12.2518 <= x < -2.796">
##FILTER=<ID=VQSRTrancheINDEL99.95to100.00,Description="Truth sensitivity tranche level for INDEL model at VQS Lod: -91515.6585 <= x < -32.0217">
##FILTER=<ID=LowQual,Description="Low quality">
##FILTER=<ID=InbreedingCoeff,Description="InbreedingCoeff < -0.3">
##FILTER=<ID=VQSRTrancheINDEL99.95to100.00+,Description="Truth sensitivity tranche level for INDEL model at VQS Lod < -91515.6585">
##FILTER=<ID=VQSRTrancheSNP99.95to100.00+,Description="Truth sensitivity tranche level for SNP model at VQS Lod < -292808.5957">
##FILTER=<ID=VQSRTrancheSNP99.95to100.00,Description="Truth sensitivity tranche level for SNP model at VQS Lod: -292808.5957 <= x < -34.5312">
##FILTER=<ID=VQSRTrancheINDEL99.90to99.95,Description="Truth sensitivity tranche level for INDEL model at VQS Lod: -32.0217 <= x < -19.6278">
##FILTER=<ID=VQSRTrancheSNP99.90to99.95,Description="Truth sensitivity tranche level for SNP model at VQS Lod: -34.5312 <= x < -12.2518">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities">
##INFO=<ID=ClippingRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref number of hard clipped bases">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP Membership">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">
##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">
##INFO=<ID=END,Number=1,Type=Integer,Description="Stop position of the interval">
##INFO=<ID=ExcessHet,Number=1,Type=Float,Description="Phred-scaled p-value for exact test of excess heterozygosity">
##INFO=<ID=FS,Number=1,Type=Float,Description="Phred-scaled p-value using Fisher's exact test to detect strand bias">
##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with at most two segregating haplotypes">
##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared agains
t the Hardy-Weinberg expectation">
##INFO=<ID=MLEAC,Number=.,Type=Integer,Description="Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each
 ALT allele, in the same order as listed">
##INFO=<ID=MLEAF,Number=.,Type=Float,Description="Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for eac
h ALT allele, in the same order as listed">
##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">
##INFO=<ID=MQ0,Number=1,Type=Integer,Description="Total Mapping Quality Zero Reads">
##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">
##INFO=<ID=NEGATIVE_TRAIN_SITE,Number=0,Type=Flag,Description="This variant was used to build the negative training set of bad variants">
##INFO=<ID=POSITIVE_TRAIN_SITE,Number=0,Type=Flag,Description="This variant was used to build the positive training set of good variants">
##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth">
##INFO=<ID=RAW_MQ,Number=1,Type=Float,Description="Raw data for RMS Mapping Quality">
##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias">
##INFO=<ID=SOR,Number=1,Type=Float,Description="Symmetric Odds Ratio of 2x2 contingency table to detect strand bias">
##INFO=<ID=VQSLOD,Number=1,Type=Float,Description="Log odds of being a true variant versus being false under the trained gaussian mixture model">
##INFO=<ID=culprit,Number=1,Type=String,Description="The annotation which was the worst performing in the Gaussian mixture model, likely the reason why the v
ariant was filtered out">
##INFO=<ID=wasSplit,Number=0,Type=Flag,Description="Specifies that the variant was split from a multi-allelic site">
##reference=file:///cromwell_root/broad-references/hg38/v0/Homo_sapiens_assembly38.fasta
##INFO=<ID=SLO,Number=0,Type=Flag,Description="Has SubmitterLinkOut - From SNP->SubSNP->Batch.link_out">
##INFO=<ID=NSF,Number=0,Type=Flag,Description="Has non-synonymous frameshift A coding region variation where one allele in the set changes all downstream amino acids. FxnClass = 44">
##INFO=<ID=R3,Number=0,Type=Flag,Description="In 3' gene region FxnCode = 13">
##INFO=<ID=R5,Number=0,Type=Flag,Description="In 5' gene region FxnCode = 15">
##INFO=<ID=NSN,Number=0,Type=Flag,Description="Has non-synonymous nonsense A coding region variation where one allele in the set changes to STOP codon (TER). FxnClass = 41">
##INFO=<ID=NSM,Number=0,Type=Flag,Description="Has non-synonymous missense A coding region variation where one allele in the set changes protein peptide. FxnClass = 42">
##INFO=<ID=G5A,Number=0,Type=Flag,Description=">5% minor allele frequency in each and all populations">
##INFO=<ID=COMMON,Number=1,Type=Integer,Description="RS is a common SNP.  A common SNP is one that has at least one 1000Genomes population with a minor allele of frequency >= 1% and for which 2 or more founders contribute to that minor allele frequency.">
##INFO=<ID=RS,Number=1,Type=Integer,Description="dbSNP ID (i.e. rs number)">
##INFO=<ID=RV,Number=0,Type=Flag,Description="RS orientation is reversed">
##INFO=<ID=TPA,Number=0,Type=Flag,Description="Provisional Third Party Annotation(TPA) (currently rs from PHARMGKB who will give phenotype data)">
##INFO=<ID=CFL,Number=0,Type=Flag,Description="Has Assembly conflict. This is for weight 1 and 2 variant that maps to different chromosomes on different assemblies.">
##INFO=<ID=GNO,Number=0,Type=Flag,Description="Genotypes available. The variant has individual genotype (in SubInd table).">
##INFO=<ID=VLD,Number=0,Type=Flag,Description="Is Validated.  This bit is set if the variant has 2+ minor allele count based on frequency or genotype data.">
##INFO=<ID=ASP,Number=0,Type=Flag,Description="Is Assembly specific. This is set if the variant only maps to one assembly">
##INFO=<ID=ASS,Number=0,Type=Flag,Description="In acceptor splice site FxnCode = 73">

##INFO=<ID=G5,Number=0,Type=Flag,Description=">5% minor allele frequency in 1+ populations">
##INFO=<ID=OM,Number=0,Type=Flag,Description="Has OMIM/OMIA">
##INFO=<ID=PMC,Number=0,Type=Flag,Description="Links exist to PubMed Central article">
##INFO=<ID=SSR,Number=1,Type=Integer,Description="Variant Suspect Reason Codes (may be more than one value added together) 0 - unspecified, 1 - Paralog, 2 -
byEST, 4 - oldAlign, 8 - Para_EST, 16 - 1kg_failed, 1024 - other">
##INFO=<ID=RSPOS,Number=1,Type=Integer,Description="Chr position reported in dbSNP">
##INFO=<ID=HD,Number=0,Type=Flag,Description="Marker is on high density genotyping kit (50K density or greater).  The variant may have phenotype associations
 present in dbGaP.">
##INFO=<ID=PM,Number=0,Type=Flag,Description="Variant is Precious(Clinical,Pubmed Cited)">
##bcftools_annotateCommand=annotate -x FORMAT chr22.vcf; Date=Thu Jul 21 22:27:22 2022
##bcftools_annotateCommand=annotate -x FORMAT --force -Oz chr22_annotate_hg38.vcf.gz; Date=Tue Sep  6 09:31:37 2022
##bcftools_annotateCommand=annotate -x FORMAT chr22_annotate_hg38.vcf.gz; Date=Mon Oct 17 12:20:29 2022
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  GTEX-1117F      GTEX-111CU      GTEX-111FC      GTEX-111VG      GTEX-111YS      GTEX-
1122O      GTEX-1128S      GTEX-113IC      GTEX-113JC      GTEX-117XS      GTEX-117YW      GTEX-117YX      GTEX-1192W      GTEX-1192X      GTEX-11DXX      GTEX-11DXZ      GTEX-11DYG      GTEX-11DZ1      GTEX-11EI6      GTEX-11EM3      GTEX-11EMC      GTEX-11EQ8      GTEX-11EQ9      GTEX-11GS4      GTEX-11GSO
      GTEX-11GSP      GTEX-11I78      GTEX-11ILO      GTEX-11LCK      GTEX-11NSD      GTEX-11NUK      GTEX-11NV4      GTEX-11O72      GTEX-11OF3      GTEX-11ONC      GTEX-11P7K      GTEX-11P81      GTEX-11P82      GTEX-11PRG      GTEX-11TT1      GTEX-11TTK      GTEX-11TUW      GTEX-11UD1      GTEX-11UD2      GTEX-11VI4      GTEX-11WQC      GTEX-11WQK      GTEX-11XUK      GTEX-11ZTS      GTEX-11ZTT      GTEX-11ZU8      GTEX-11ZUS      GTEX-11ZVC      GTEX-1211K      GTEX-12126      GTEX-1212Z      GTEX-12584      GTEX-12696      GTEX-1269C      GTEX-12C56      GTEX-12KS4      GTEX-12WS9      GTEX-12WSA      GTEX-12WSB
      GTEX-12WSD      GTEX-12WSE      GTEX-12WSF      GTEX-12WSG      GTEX-12WSH      GTEX-12WSI      GTEX-12WSJ      GTEX-12WSK      GTEX-12WSL      GTEX-12WSM      GTEX-12WSN      GTEX-12ZZW      GTEX-12ZZX      GTEX-12ZZY      GTEX-12ZZZ      GTEX-13111      GTEX-13112      GTEX-13113      GTEX-1313W      GTEX-1314G      GTEX-131XE      GTEX-131XF      GTEX-131XG      GTEX-131XH      GTEX-131XW      GTEX-131YS      GTEX-132AR      GTEX-132NY      GTEX-132Q8      GTEX-132QS      GTEX-1339X      GTEX-133LE      GTEX-1399Q      GTEX-1399R      GTEX-1399S      GTEX-1399T      GTEX-1399U      GTEX-139D8      GTEX-139T4
      GTEX-139T6      GTEX-139TS      GTEX-139TT      GTEX-139TU      GTEX-139UC      GTEX-139UW      GTEX-139YR      GTEX-13CF2      GTEX-13CF3      GTEX-13CIG      GTEX-13CZU      GTEX-13CZV      GTEX-13D11      GTEX-13FH7      GTEX-13FHO      GTEX-13FHP      GTEX-13FLV      GTEX-13FLW      GTEX-13FTW      GTEX-13FTX      GTEX-13FTY      GTEX-13FTZ      GTEX-13FXS      GTEX-13G51      GTEX-13IVO      GTEX-13JUV      GTEX-13JVG      GTEX-13N11      GTEX-13N1W      GTEX-13N2G      GTEX-13NYB      GTEX-13NYC      GTEX-13NZ8      GTEX-13NZ9      GTEX-13NZA      GTEX-13NZB      GTEX-13O21      GTEX-13O3O      GTEX-13O3P
      GTEX-13O3Q      GTEX-13O61      GTEX-13OVG      GTEX-13OVH      GTEX-13OVI      GTEX-13OVJ      GTEX-13OVK      GTEX-13OVL      GTEX-13OW5      GTEX-13OW6      GTEX-13OW7      GTEX-13OW8      GTEX-13PDP      GTEX-13PL6      GTEX-13PL7      GTEX-13PLJ      GTEX-13PVQ      GTEX-13PVR      GTEX-13QBU      GTEX-13QIC      GTEX-13QJ3      GTEX-13QJC      GTEX-13RTJ      GTEX-13RTK      GTEX-13RTL      GTEX-13S7M      GTEX-13S86      GTEX-13SLW      GTEX-13SLX      GTEX-13U4I      GTEX-13VXT      GTEX-13VXU      GTEX-13W3W      GTEX-13W46      GTEX-13X6H      GTEX-13X6I      GTEX-13X6J      GTEX-13X6K      GTEX-13YAN
      GTEX-1445S      GTEX-144FL      GTEX-144GL      GTEX-144GM      GTEX-145LU      GTEX-145MF      GTEX-145MG      GTEX-145MH      GTEX-145MI      GTEX-145MN      GTEX-145MO      GTEX-146FH      GTEX-146FQ      GTEX-146FR      GTEX-14753      GTEX-1477Z   
chr22   10510212        rs1452389754    A       T       407.41  VQSRTrancheSNP99.80to99.90      AC=11;AF=0.0240175;AN=458;BaseQRankSum=0.437;ClippingRankSum=0.212;DP=1221;ExcessHet=0;FS=0;InbreedingCoeff=0.1007;MLEAC=9;MLEAF=0.018;MQ=18.93;MQRankSum=-1.855;NEGATIVE_TRAIN_SITE;QD=15.67;ReadPosRankSum=0;SOR=1.697;VQSLOD=-8.428;culprit=DP;ASP;RS=1452389754;RSPOS=10510212;SAO=0;SSR=0;TOPMED=0.65805778542303771,0.34194221457696228;VC=SNV;VP=0x050000000005000002000100;WGT=1;dbSNPBuildID=151       GT      ./.     ./.     0/0     ./.     ./.     ./.     0/0     ./.     0/0     ./.     ./.     ./.     ./.     ./.     ./.     0/0
     0/0     ./.     ./.     ./.     ./.     ./.     ./.     0/0     0/0     ./.     ./.     ./.     ./.     0/0     0/0     ./.     ./.     ./.     ./.     ./.     ./.     0/0     ./.     ./.     0/0     0/0     0/0     ./.     0/0     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.
     ./.     ./.     ./.     ./.     0/0     0/0     ./.     ./.     ./.     0/0     0/0     0/0     ./.     1/1     ./.     ./.     ./.     0/0     ./.     ./.     0/0     ./.     ./.     ./.     0/0     0/0     0/0     0/0     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.
     0/0     0/0     ./.     ./.     0/0     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     ./.     0/0     ./.     0/0     ./.     ./.     ./.     ./.     0/0     ./.     ./.     0/0     0/0     ./.     ./.     ./.     0/0     ./.     0/0     ./.     ./.     0/0     ./.     ./.     ./.     ./.
     0/0     ./.     ./.     0/0     ./.     ./.     0/0     ./.     ./.     ./.     ./.     ./.     0/0     0/1     ./.     ./.     0/0     0/0     ./.     ./.     ./.     0/0     ./.     ./.     ./.     ./.     ./.     ./.     ./.     0/0     ./.     ./.     0/0     ./.     ./.     ./.     ./.     ./.     0/0
     0/0     ./.     ./.     ./.     ./.     ./.     0/0     ./.     0/0     ./.     
model machine vcf learning ethnicity • 554 views
ADD COMMENT
0
Entering edit mode

How does your VCF look like? Could you edit your post and add an example line?

ADD REPLY
0
Entering edit mode

I have added my vcf file header

ADD REPLY

Login before adding your answer.

Traffic: 1961 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6