Question: Parsing Variants From 1000 Genomes Data
gravatar for Halit
7.9 years ago by
Halit90 wrote:

Hey guys,

I need to extract polymorphism data from 1000 genomes data for about 85 coding genes.

What I need in particular for each gene is (1) silent polymorphisms, (2) amino acid changing polymorphisms and (3) stop-inducing polymorphisms and (4) global allele frequencies (I guess, for n = 1029).

I know I can get this information through the official 1000 genomes browser or Ensemble web site. But is there anyway that I can automate this process and do it in a go?

I thought of the following strategy, but perhaps you might suggest a more clean and faster way.

  1. Get chromosomal position data for each gene (i.e., exon start / end, +/- strand) [from UCSC perhaps?!]
  2. Download genotype files (vcf) for each chromosome for 1000 genomes [phase 1, release v3, March 2011 calls or should I just stick to high coverage data?]
  3. Pick vcf for Chr 1; check whether any SNP falls inbetween some exons, if it does, note it down.
  4. Among the noted SNPs, parse allele variant and allele frequency (AF)
  5. Determine the amino acid position of the corresponding SNP
  6. Check the resulting amino acid state when the variant is introduced (be careful about +/- strand)
  7. Classify the polymorphisms and report the corresponding AF
1000genomes snp • 4.1k views
ADD COMMENTlink modified 7.7 years ago by Laura1.7k • written 7.9 years ago by Halit90

Can you update us how eventually did you performed this analysis ?

ADD REPLYlink written 6.4 years ago by User 1933340
gravatar for Martin Morgan
7.9 years ago by
Martin Morgan1.6k
United States
Martin Morgan1.6k wrote:

The locateVariants and predictCoding functions in VariantAnnotation do these operations in R / Bioconductor; there are some issues with strand handling that are likely to be addressed in the next day or so (e.g., by April 25, 2012). Data comes from (user) VCF files, with things like genome and UCSC known genes provided by annotation packages, e.g., BSgenome.Hsapiens.UCSC.hg19 and TxDb.Hsapiens.UCSC.hg19.knownGene. Both genome and transcript data bases can be customized for non-model organisms. See the VariantAnnotation vignette (pdf) for details.

ADD COMMENTlink modified 3 months ago by RamRS26k • written 7.9 years ago by Martin Morgan1.6k

Thanks Martin. This looks pretty useful. I wanted to give a try with "finding all coding SNPs in chr22" but I failed in the first step. (1) I downloaded the vcf file for chr22 from 1KG ftpm (2) loaded the relevant package by calling library(VariantAnnotation), (3) pointed to the file by specifying inputFile <- system.file("extdata", "Chromosome22.vcf.gz", package="VariantAnnotation"), and then (4) reading the file content vcf <- readVcf(inputFile, "hg19"). At this step, I get the following error: Error: scanVcf: record 28059 INFO '0|0:0.000:-0.05,-0.96,-5.00' not found path: C:UsersPC101517DocumentsRwin-library2.15VariantAnnotationextdataChromosome22.vcf.gz /// Any thoughts would be very much appreciated.

ADD REPLYlink written 7.9 years ago by Halit90

VariantAnnotation is complaining about a VCF record (28059) that it cannot parse; it looks like a genotype record is trying to be parsed as an INFO field. I'd suggest posting to the Bioconductor mailing list where you can provide sessionInfo and perhaps that portion of the file that is causing problems.

ADD REPLYlink written 7.9 years ago by Martin Morgan1.6k
gravatar for Laura
7.9 years ago by
Cambridge UK
Laura1.7k wrote:

You could find your gene in the 1000 genomes browser

Get a vcf file for it using our most recent release

and the data slicer

The if small enough (<750 variants) you can use the web interface to the Variant Effect Predictor

Alternatively you can use the script

Please use the most recent vcf files, this will be much more accurate than any old data set

ADD COMMENTlink written 7.9 years ago by Laura1.7k

Thanks for reply, Laura. While this seems the straightforward approach to take, the number of genes I want to analyze has been increasing - so I would like to find out an automated way that is able to efficiently process hundreds of genes across all chromosomes.

ADD REPLYlink written 7.9 years ago by Halit90
gravatar for Sean Davis
7.9 years ago by
Sean Davis26k
National Institutes of Health, Bethesda, MD
Sean Davis26k wrote:

You could download the VCF from 1000G and then running ANNOVAR, ensembl variant effect predictor, or snpEff.

ADD COMMENTlink written 7.9 years ago by Sean Davis26k

Thanks, Sean. I downloaded Annovar and will check out in an hour or so - let's see if it's any good for my problem.

ADD REPLYlink written 7.9 years ago by Halit90
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1433 users visited in the last hour