I've been working with STR data in VCF format that looks like this:
ID REF ALT INFO I1 I2 I3 I4 I5 STR345 TTTTTTTTTTTT TTTTTTTTTT,TTTCTTTTTTT,TTTTTTTTTTT,TTTTTTTTTTTTT DR2=0.81,0.23,0.9,0.1 0/0 1/0 3/0 3/2 4/4 1/1
So here we see that multiple alternative alleles contain different number of the T expansion, the DR2 is a quality metric for each allele, where only DR2 > 0.3 will be kept, and then we see the genotypes for 5 individuals.
I would like to perform an association analysis of these variants (I have more than 100K) with my phenotype of interest (binary). I've been suggested to use logistic regression, modeling the allele length of the STRs, relative to the reference, with disease status, but I'm not sure how to transform this data in order to get the length. One idea I have is to transform the VCF data to this format:
STR345 0 -2 -1 NA NA -2
Where the first individual is 0 because it is homozygous to the reference (0/0), the second is -2 because it has only 1 copy of the first alternative allele wich has 2 T less compared to the reference, the third is -1 because it has only 1 copy of the third allele that has 1T less than the reference, the next 2 individuals are missing, because they have alleles with DR2 below the threshold and finally, the fifth individual is -2 because it has 2 copies of the first allele, each of them with 1 T less compared to the reference.
My question is if any of you have ever worked with STRs in this fashion before? and I would like to know if my approach makes sense to you? I forgot to mention that first I attempted to perform the association splitting the STRs into biallelic variants, but with that I lose the length information of each event in the population, therefore I think that taking the length is the best approach for this.
I'll appreciate your help! Please let me know if more info is required!