Question

Finding SNPs in 1000genomes populations

0

Entering edit mode

6.5 years ago

andrew.ghaiyed • 0

Hi All,

I would like to identify SNPs that help distinguish Japanese and Chinese populations using 1000genomes data. Currently I am a little lost with all the potential programs and software packages to download and was wondering if anyone can direct me to a straight forward pipeline. I need to be able to "rank" in some way the possible set of SNPs and therefore prioritise the level of discrimination power of a SNP in separating two populations.

Thanks in advance!

1000genomes GWAS • 1.8k views

ADD COMMENT • link 6.5 years ago by andrew.ghaiyed • 0

0

Entering edit mode

Thank you very much Kevin,

I was able to download plink and set up a directory fine but when trying to download the VCF.gz files I see this error bash: wget: command not found -bash: wget: command not found -bash: wget: command not found -bash: wget: command not found -bash: wget: command not found -bash: wget: command not found -bash: wget: command not found -bash: wget: command not found

I have tried copy pasting each line of text and writing manually but nothing seems to change the result.

Thanks again,

Andrew

ADD REPLY • link 6.5 years ago by andrew.ghaiyed • 0

0

Entering edit mode

It's possible that wget is not installed on your computer. You can try sudo apt install wget (you will require administrator rights, and will have to provide a password).

Otherwise, try curl -O instead of wget

ADD REPLY • link 6.5 years ago by Kevin Blighe 87k

0

Entering edit mode

Thanks Kevin you have been an incredible help!

I was able to get wget on my computer and moved a bit further along. I have managed to download the PED file and the reference files but struggling when converting 1000 Genomes files to BCF. my input code is (Volumes/Seagate is the directory on my external hard drive):

or chr in {1..22} X; do

    bcftools norm -Ou -m-any /Volumes/Seagate/1000Genomes/chr$chr.1kg.phase3.v5.vcf.gz | bcftools norm -Ou -f /Volumes/Seagate/ReferenceMaterial/1000Genomes/human_g1k_v37.fasta | bcftools annotate -Ob -I +'%CHROM:%POS:%POS:%REF:%ALT' > /Volumes/Seagate/1000Genomes/chr$chr.1kg.phase3.v5.bcf ;

    bcftools index /Volumes/Seagate/1000Genomes/chr$chr.1kg.phase3.v5.bcf ;

done

but plink returns with :

[main] Unrecognized command.

Just wondering if you knew what the most likely cause was..

Thanks again!

ADD REPLY • link 6.4 years ago by andrew.ghaiyed • 0

0

Entering edit mode

Hey Andrew, are you specifying the --bcf command line parameter when trying to read into PLINK?

plink --noweb --bcf chr$chr.1kg.phase3.v5.bcf --keep-allele-order --vcf-idspace-to _ --const-fid --allow-extra-chr 0 --split-x b37 no-fail --make-bed --out /chr$chr.1kg.phase3.v5 ;

ADD REPLY • link 6.4 years ago by Kevin Blighe 87k

score 0 · Answer 1 · 2017-10-31

Hey Andrew,

You could follow my tutorial here: Produce PCA for 1000 Genomes Phase III in VCF format

You will have to adapt it to your own needs; however, even just by following this simple example, you will be capable of identifying SNPs in the 1000 Genomes Phase III populations that distinguish the major population groups. You should be easily able to focus down on just the Japanese and Chinese groups.

I have already built an ethnicity predictive model from the 1000 Genomes data that has >99.999% sensitivity (non-commercial). I have been using it in a private project in order to help identify samples with unreported ethnicity. It was built in R using glm() and has been cross-validated.

Good luck, Kevin