Question: Extracting SNP data for specific rs#s from published genome sequences
6.5 years ago by
Stony Brook
devenvyas650 wrote:

had a question about a project I am conceptualizing. Since I have no experience yet dealing with nuclear DNA, I have some questions.

I have SNP data on 64 samples from my population of interest (~330,000 SNPs per sample using the HumanCNV370-Quad).

I will likely be SNP typing some more in the near future, but I wanted to see what I can do with the existing SNP data in regards to estimating archaic introgression. I know Sánchez-Quinto et al. (2012) ( and Reich et al. (2011) ( had used f4 statistics (described in depth by Patterson et al. (2012) here to estimate Neanderthal and Denisovan ancestry respectively using SNP data.

Basically, (f4(A,O;X,C))/(f4(A,O;B,C)) equals the estimator of Neanderthal ancestry when A=Denisovan, B=Neanderthal, C=YRI, O=Pan troglodytes or paniscus, and X=My data and other comparative populations.

I need to be able to align a Pan genome to the high coverage Altai Neanderthal and Denisovan genomes and the YRI genomes to extract polymorphism data for the ~330,000 rs #s the array typed, and then filter out cases of C-T/G-A (Modern-Archaic) sites. I have no idea how to start on this, and I was wondering if anyone here had an idea for where I should start? Thanks!

just guessing here but how about extracting the a few hundred sequences around each of your snps, say 150bp that cover the SNP somewhere randomly in the 150bp, aligning those to the other genomes and calling snps on those

I unfortunately do not know how to do that. Beyond assembling mitogenomes or getting BEAST to run, I'm still very new to computational stuff.

I have managed to find the Altai Neanderthal VCF files and start dl'ing them to the cluster. I am trying to filter the SNPs by these rs numbers, but I keep getting error messages as shown before

VCFtools - v0.1.11
(C) Adam Auton 2009

Parameters as interpreted:
        --gzvcf AltaiNea.hg19_1000g.1.mod.vcf.gz
        --out filtered_AltaiNea.hg19_1000g.1_
        --snps 330k.txt

Using zlib version: 1.2.3
Versions of zlib >= 1.2.4 will be *much* faster when reading zipped VCF files.
Reading Index file.
Building new index file.
        Scanning Chromosome: 1
        Warning - file contains entries with the same position. These entries will be processed separately.

        Scanning Chromosome: ;GAnc=C;OAnc=C;bSC=640;mSC=0.001;pSC=0.138;GRP=0.28;Map20=1
        Scanning Chromosome: 1
Error: VCF file is not sorted at position 1:3.
