I have been recently dealing with the Personal Genomes Project, and trying to work with the data. I downloaded the raw data for an individual's whole genome.
The main concern is the format of the data. Complete Genomics frees the genomes of the individuals in its own format; a format called masterVar which looks like this:
#ASSEMBLY_ID GS000014558-ASM #COSMIC COSMIC v48 #DBSNP_BUILD dbSNP build 132 #GENOME_REFERENCE NCBI build 37 #SAMPLE GS01669-DNA_D02 #GENERATED_BY cgatools #GENERATED_AT 2012-Sep-28 19:43:38.251270 #SOFTWARE_VERSION 18.104.22.168 #FORMAT_VERSION 2.0 #GENERATED_BY dbsnptool #TYPE VAR-ANNOTATION >locus ploidy allele chromosome begin end varType reference alleleSeq varScoreVAF varScoreEAF varQuality hapLink xRef 17 2 all chr1 11365 11370 ref = = 302 2 1 chr1 21579 21580 snp C T 123 123 VQHIGH dbsnp.83:rs526642 302 2 2 chr1 21579 21580 snp C T 153 153 VQHIGH dbsnp.83:rs526642
They provide some tools to work on it and I tried to convert to vcf with this tool, but what I get is some kind of funny vcf, with duplicated entries and inconsistent information.
Has anyone dealt with it before?
Thanks in advance!