Question

SNPs used to create tree using RELATE

1

Entering edit mode

3.5 years ago

njandaro ▴ 10

Hi!

I am trying to estimate the genealogy trees using RELATE. I need these to be used later as inputs for PALM (Stern et al 2020). I am using 1000 genomes dataset focusing on EUR populations.

There are two steps of RELATE's algorithm that I do not understand.

I noticed that at the data preparation stage 80% of all SNPs are removed whenever I try to subset a certain population. There is no specific message describing why these SNPs get thrown out. And I am unable to spot any specific pattern among kept or removed SNPs.
It seems that what RELATE understands as an ancestral allele for a SNP is not the same as ancestral allele that I can read in the INFO column of the original VCF file. I am passing to RELATE the ancestral fasta files that I downloaded from 1000 Genomes. So, in principle, both ancestral informations are coming from the same source. But the allele coded with 0 in the RELATE output file does not correspond 1-1 with the ancestral allele I read from INFO column.

In the latter point, I am not sure if maybe I am doing something wrong. I simply downloaded the zip file with ancestral alleles, unzipped it and gave RELATE path to the .fa file for the corresponding chromosome.

I also can't understand fasta files. For example, chromosome 1 in the original vcf file has a little more than 6 million SNPs. Fasta file for chromosome 1 has more than 200 million characters. I have no idea how to relate 200 million characters to 6 million SNPs.

If someone could clarify any of the above confusions/questions, I'd be very grateful!

Best, Fatima

RELATE 1000 genomes SNP • 556 views

ADD COMMENT • link 3.5 years ago by njandaro ▴ 10