Question: How To Get Snps Matrix For Population Genetic Ananlysis From Snps Variant Files
gravatar for Jianfengmao
9.8 years ago by
Jianfengmao310 wrote:

Dear seqanswers,

I am new to genomics and bioinformatics. In my current study, we have sequenced the genomes of tens of accessions of a plant, using Illumina next generation sequencer. The short reads of a specific accession have been aligned to the reference. The SNPs and shor indels have been predicted for a specific accession genome to the reference. We have gotten the separate files for SNPs like the following format (in text file, the column names were listed to each accession, the accession name will not change for a specific accession):

<accession names> <chromosome><position><reference base><cons

But usually, we need to align all the accessions in the following format for classical population genetic analysis:

accession_1, a,t,g,,,
accession_2, a,t,c,,,
accession_3, t,a,c,,,

I need to get helps, suggestions on how to do this format conversion, or if there are any alternative choices for me, by using R and bioconductor or other tools? If it need database operations, and how to do that?

Thanks in advance.

snp genetics population matrix • 3.5k views
ADD COMMENTlink modified 2.1 years ago by RamRS30k • written 9.8 years ago by Jianfengmao310

This is Biostar group not seqanswers.

ADD REPLYlink written 9.6 years ago by Thaman3.3k

Relevant Bioconductor thread

ADD REPLYlink modified 12 months ago by RamRS30k • written 9.8 years ago by Brad Chapman9.5k
gravatar for Larry_Parnell
9.8 years ago by
Boston, MA USA
Larry_Parnell16k wrote:

I know that very similar work was done a couple-few years ago by Joe Ecker's group at the Salk Institute with something like 98 accessions of Arabidopsis thaliana. You should get their published article on this and then find out who in their group manipulated the data in a manner similar to the objectives you have.

One version of that work is here.

Update for 30 Aug 2011: Nature article on multiple reference genomes and transcriptomes for Arabidopsis thaliana and one from Nature Genetics on whole-genome sequencing of 80 A. thaliana populations are now both available.

ADD COMMENTlink modified 9.1 years ago • written 9.8 years ago by Larry_Parnell16k
gravatar for Tommy Au
9.6 years ago by
Tommy Au10
Hong Kong
Tommy Au10 wrote:

You could try R function reshape(). For example, if you have a tab-delimited file called

acc1    snp1    a
acc1    snp2    g
acc2    snp1    t
acc2    snp2    g

Import the file as a data frame and use reshape() in direction "wide":

> genotypes<-read.table("",sep="\t",header=FALSE,col.names=c("acc","snp","genotype"))
> genotypes
   acc  snp genotype
1 acc1 snp1        a
2 acc1 snp2        g
3 acc2 snp1        t
4 acc2 snp2        g
> genotype.table<-reshape(genotypes, idvar="snp", timevar="acc", direction="wide")
> genotype.table
   snp genotype.acc1 genotype.acc2
1 snp1             a             t
2 snp2             g             g


ADD COMMENTlink modified 2.1 years ago by RamRS30k • written 9.6 years ago by Tommy Au10
gravatar for Vitis
9.2 years ago by
New York
Vitis2.4k wrote:

I'd get the bam alignment files, call pileups with consensus and reconstruct the query sequences. This way, you'll have better control of the quality and confidence on the SNPs. And the reconstructed sequences are very useful for any population or evolutionary studies.

ADD COMMENTlink written 9.2 years ago by Vitis2.4k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1176 users visited in the last hour