I'm trying to define a new way to create a "unique_key" for each genotype in a vcf file to use as an index into an SQL database.
For example for this two lines:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003 20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,. 20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4
I would get a list of unique indexes like this:
20-14370-G-G 20-14370-A-G 20-14370-A-A 20-1110696-G-T 20-1110696-T-T
Obs: For 1|2 and 2|1 I would get the same id 20-1110696-G-T
My primary reason for this would be to quickly compare and filter some genotypes from two tables, for example:
table1 has 40k rows and table2 has 100 million rows
If I had a "unique id" like that I could quickly eliminate genotypes from table1 that would be present in table2.
Does it make any sense to create a unique id for each genotype or am I trying to reinvent the wheel somehow ?
Is there any other more correct way of achieving this goal of generating a unique id for each genotype present in the VCF ?
PS: Yes I know how to intersect positions from a BED or a VCF file, but this time I need to find a way to generate a unique key for each genotype in order to do this comparison quickly using an SQL query.
Thank you for your attention! :)