How To Get Snps Matrix For Population Genetic Ananlysis From Snps Variant Files
3
4
Entering edit mode
12.3 years ago
Jianfengmao ▴ 310

I am new to genomics and bioinformatics. In my current study, we have sequenced the genomes of tens of accessions of a plant, using Illumina next generation sequencer. The short reads of a specific accession have been aligned to the reference. The SNPs and shor indels have been predicted for a specific accession genome to the reference. We have gotten the separate files for SNPs like the following format (in text file, the column names were listed to each accession, the accession name will not change for a specific accession):

<accession names> <chromosome><position><reference base><cons
base><quality><support><concordance><avg_hits>


But usually, we need to align all the accessions in the following format for classical population genetic analysis:

<accessions><SNP_1><SNP_2><SNP_3><SNP_...>
accession_1, a,t,g,,,
accession_2, a,t,c,,,
accession_3, t,a,c,,,
accession_,,,,,,,,,,,,,


I need to get helps, suggestions on how to do this format conversion, or if there are any alternative choices for me, by using R and bioconductor or other tools? If it need database operations, and how to do that?

snp matrix population genetics • 4.1k views
2
Entering edit mode

This is Biostar group not seqanswers.

1
Entering edit mode
2
Entering edit mode
12.3 years ago

I know that very similar work was done a couple-few years ago by Joe Ecker's group at the Salk Institute with something like 98 accessions of Arabidopsis thaliana. You should get their published article on this and then find out who in their group manipulated the data in a manner similar to the objectives you have.

One version of that work is here.

Update for 30 Aug 2011: Nature article on multiple reference genomes and transcriptomes for Arabidopsis thaliana and one from Nature Genetics on whole-genome sequencing of 80 A. thaliana populations are now both available.

1
Entering edit mode
12.2 years ago
Tommy Au ▴ 10

You could try R function reshape(). For example, if you have a tab-delimited file called genotypes.tab:

acc1    snp1    a
acc1    snp2    g
acc2    snp1    t
acc2    snp2    g


Import the file as a data frame and use reshape() in direction "wide":

> genotypes<-read.table("genotypes.tab",sep="\t",header=FALSE,col.names=c("acc","snp","genotype"))
> genotypes
acc  snp genotype
1 acc1 snp1        a
2 acc1 snp2        g
3 acc2 snp1        t
4 acc2 snp2        g
> genotype.table<-reshape(genotypes, idvar="snp", timevar="acc", direction="wide")
> genotype.table
snp genotype.acc1 genotype.acc2
1 snp1             a             t
2 snp2             g             g


Done.

1
Entering edit mode
11.7 years ago
Vitis ★ 2.5k

I'd get the bam alignment files, call pileups with consensus and reconstruct the query sequences. This way, you'll have better control of the quality and confidence on the SNPs. And the reconstructed sequences are very useful for any population or evolutionary studies.