Question

How To Get Snps Matrix For Population Genetic Ananlysis From Snps Variant Files

4

Entering edit mode

13.4 years ago

Jianfengmao ▴ 310

Dear seqanswers,

I am new to genomics and bioinformatics. In my current study, we have sequenced the genomes of tens of accessions of a plant, using Illumina next generation sequencer. The short reads of a specific accession have been aligned to the reference. The SNPs and shor indels have been predicted for a specific accession genome to the reference. We have gotten the separate files for SNPs like the following format (in text file, the column names were listed to each accession, the accession name will not change for a specific accession):

<accession names> <chromosome><position><reference base><cons
base><quality><support><concordance><avg_hits>

But usually, we need to align all the accessions in the following format for classical population genetic analysis:

<accessions><SNP_1><SNP_2><SNP_3><SNP_...>
accession_1, a,t,g,,,
accession_2, a,t,c,,,
accession_3, t,a,c,,,
accession_,,,,,,,,,,,,,

I need to get helps, suggestions on how to do this format conversion, or if there are any alternative choices for me, by using R and bioconductor or other tools? If it need database operations, and how to do that?

Thanks in advance.

snp matrix population genetics • 4.5k views

ADD COMMENT • link updated 5.6 years ago by Ram 43k • written 13.4 years ago by Jianfengmao ▴ 310

2

Entering edit mode

This is Biostar group not seqanswers.

ADD REPLY • link 13.2 years ago by Thaman ★ 3.3k

1

Entering edit mode

Relevant Bioconductor thread

ADD REPLY • link updated 4.6 years ago by Ram 43k • written 13.4 years ago by Brad Chapman 9.7k

score 2 · Answer 1 · 2010-12-13

I know that very similar work was done a couple-few years ago by Joe Ecker's group at the Salk Institute with something like 98 accessions of Arabidopsis thaliana. You should get their published article on this and then find out who in their group manipulated the data in a manner similar to the objectives you have.

One version of that work is here.

Update for 30 Aug 2011: Nature article on multiple reference genomes and transcriptomes for Arabidopsis thaliana and one from Nature Genetics on whole-genome sequencing of 80 A. thaliana populations are now both available.

Ram · Answer 2 · 2011-02-02

You could try R function reshape(). For example, if you have a tab-delimited file called genotypes.tab:

acc1    snp1    a
acc1    snp2    g
acc2    snp1    t
acc2    snp2    g

Import the file as a data frame and use reshape() in direction "wide":

> genotypes<-read.table("genotypes.tab",sep="\t",header=FALSE,col.names=c("acc","snp","genotype"))
> genotypes
   acc  snp genotype
1 acc1 snp1        a
2 acc1 snp2        g
3 acc2 snp1        t
4 acc2 snp2        g
> genotype.table<-reshape(genotypes, idvar="snp", timevar="acc", direction="wide")
> genotype.table
   snp genotype.acc1 genotype.acc2
1 snp1             a             t
2 snp2             g             g

Done.

score 1 · Answer 3 · 2011-07-07

1

Entering edit mode

12.8 years ago

Vitis ★ 2.5k

I'd get the bam alignment files, call pileups with consensus and reconstruct the query sequences. This way, you'll have better control of the quality and confidence on the SNPs. And the reconstructed sequences are very useful for any population or evolutionary studies.

ADD COMMENT • link 12.8 years ago by Vitis ★ 2.5k