Hi!
I am trying to estimate the genealogy trees using RELATE. I need these to be used later as inputs for PALM (Stern et al 2020). I am using 1000 genomes dataset focusing on EUR populations.
There are two steps of RELATE's algorithm that I do not understand.
I noticed that at the data preparation stage 80% of all SNPs are removed whenever I try to subset a certain population. There is no specific message describing why these SNPs get thrown out. And I am unable to spot any specific pattern among kept or removed SNPs.
It seems that what RELATE understands as an ancestral allele for a SNP is not the same as ancestral allele that I can read in the INFO column of the original VCF file. I am passing to RELATE the ancestral fasta files that I downloaded from 1000 Genomes. So, in principle, both ancestral informations are coming from the same source. But the allele coded with 0 in the RELATE output file does not correspond 1-1 with the ancestral allele I read from INFO column.
In the latter point, I am not sure if maybe I am doing something wrong. I simply downloaded the zip file with ancestral alleles, unzipped it and gave RELATE path to the .fa file for the corresponding chromosome.
I also can't understand fasta files. For example, chromosome 1 in the original vcf file has a little more than 6 million SNPs. Fasta file for chromosome 1 has more than 200 million characters. I have no idea how to relate 200 million characters to 6 million SNPs.
If someone could clarify any of the above confusions/questions, I'd be very grateful!
Best, Fatima