Question: Determining Which New Snps In 1000G Data Result In Coding Changes
7
gravatar for Ryan D
8.2 years ago by
Ryan D3.3k
USA
Ryan D3.3k wrote:

As you all know the 1000G sites are a sheer freaking delight to navigate.

But I understand from here: http://www.1000genomes.org/page.php?page=announcements that the exonic sequences are available, but only cover 8,140 exons in 906 "randomly selected" genes: ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/pilot_data/technical/working/20100511_snp_annotation/pilot3/annotated.CHB.final.P3.

The paper just released has info on the 1000G pilot: http://www.1000genomes.org/bcms/1000_genomes/Documents/nature09534.pdf

But in addition to mining the variants in LD with those significant in our GWAS using haploxt: http://www.sph.umich.edu/csg/abecasis/GOLD/docs/haploxt.html

what is the best way to determine if any variants in LD map to exons and cause coding changes? Is there a tool like Polyphen through which they could be run in batches?

ADD COMMENTlink modified 6 days ago by Biostar ♦♦ 20 • written 8.2 years ago by Ryan D3.3k
1

In the 1000G paper it says: “In total, we found 68,300 non-synonymous SNPs, 34,161 of which were novel.”

Presumably a table must exist to get such a number which has genome positions, rs#s, and predictions for which variants are functional.

If anyone could find that table or that data on the 1000G website, that would solve this problem.

ADD REPLYlink written 8.2 years ago by Ryan D3.3k
6
gravatar for Louis Letourneau
8.2 years ago by
Montreal
Louis Letourneau790 wrote:

There is also snpEff http://snpeff.sourceforge.net/ and annovar http://www.openbioinformatics.org/annovar/ which can help with the snp effect prediction. They're both along the same lines as consequence (Pierre's program), they just output more info I think.

They also have the capability of having external GFF annotations added to the result.

ADD COMMENTlink written 8.2 years ago by Louis Letourneau790
1

It's interesting because the first tool has been integrated into galaxy....

ADD REPLYlink written 8.2 years ago by Pierre Lindenbaum116k
4
gravatar for Pierre Lindenbaum
8.2 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum116k wrote:

Polyphen has a 'batch' mode : http://genetics.bwh.harvard.edu/pph2/bgi.shtml

There is an API developed by/for Ensembl :

Bioinformatics. 2010 Aug 15;26(16):2069-70. Epub 2010 Jun 18. Deriving the consequences of genomic variants with the Ensembl API and SNP Effect Predictor.

On my side, last year, I wrote a simple program using the UCSC 'knownGene' table to predict the consequences of the mutations. A early version was described here: http://plindenbaum.blogspot.com/2009/04/consequences-snp-cdna-proteins-etc.html

ADD COMMENTlink written 8.2 years ago by Pierre Lindenbaum116k

This is really good, Pierre. It appears that it takes input like the kind I could provide (e.g. chr1:1158631, rs11689281). Now I just need someone to hook me up with the elusive tables the authors do not seem to have provided.

The paper supplementary material says: "The coordinates and predicted functional consequences of all of the LOF variants identified in the project are available on the 1000 Genomes FTP site."

ADD REPLYlink written 8.2 years ago by Ryan D3.3k
4
gravatar for Laura
8.2 years ago by
Laura1.7k
Cambridge UK
Laura1.7k wrote:

You can also use the ensembl snp effect predictor

http://www.ensembl.org/Homo_sapiens/UserData/UploadVariations

The lost of function variants are all annotated as part of the standard paper data set which can be found on the 1000genomes ftp site

ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/pilot_data/paper_data_sets/a_map_of_human_variation/

There are a lot of files to navigate but as this represents a lot of data it was felt this was the best way to distribute it

As the code in a comment to another answer seems to of been a little mangled here it is again, a quick one liner with a bash foreach loop and our current.tree file which gets you all the files based on a simple grep

for file in `grep LOF current.tree | cut -f 1  | grep a_map_of_human_variation`; do wget "ftp://ftp.1000genomes.ebi.ac.uk/vol1/"$file; done
ADD COMMENTlink modified 8.2 years ago • written 8.2 years ago by Laura1.7k
3
gravatar for Neilfws
8.2 years ago by
Neilfws48k
Sydney, Australia
Neilfws48k wrote:

Regarding the sentence:

The coordinates and predicted functional consequences of all of the LOF variants identified in the project are available on the 1000 Genomes FTP site.

I suspect that these are the files named .LOF.txt.gz (or just .LOF.txt). They seem to be scattered through various directories at the FTP site.

For example this FTP directory contains "README.2010_07.lof_variants", with LOF files in the exon/, low_coverage/ and trio/ sub-directories (and in fact, more sub-directories therein, e.g. exon/snps, exon/indels). The directory for data from the paper seems to have a similar structure.

You may just have to navigate through the FTP site, taking notes and reading README files until you find what you want. Or I guess, email the authors and ask for a direct link to the LOF data.

ADD COMMENTlink written 8.2 years ago by Neilfws48k
1
gravatar for Deniz
8.2 years ago by
Deniz140
Cambridge
Deniz140 wrote:

A good way to do it would be to parse the VCF files provided at the 1000 genomes website, and use the fields provided inside the files to filter according to your needs.

ADD COMMENTlink written 8.2 years ago by Deniz140
1
gravatar for Larry_Parnell
8.2 years ago by
Larry_Parnell16k
Boston, MA USA
Larry_Parnell16k wrote:

You wrote, "In the 1000G paper it says: “In total, we found 68,300 non-synonymous SNPs, 34,161 of which were novel.” Presumably a table must exist to get such a number which has genome positions, rs#s, and predictions for which variants are functional. If anyone could find that table or that data on the 1000G website, that would solve this problem."

I would write to Daniel MacArthur and ask him for that table. He is the one who worked on the loss of function variants identified in the 1000G data and presented this at ASHG last week. He is on Twitter as dgmacarthur.

ADD COMMENTlink written 8.2 years ago by Larry_Parnell16k
3

no need to write to Daniel, the data is all on the ftp site

If you look at the current tree file at the root of the ftp site ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/current.tree

you can quickly find the ftp path of all the appropriate files

laura@1000genomes[~]:for file in grep LOF current.tree | cut -f 1 | grep a_map_of_human_variation; do echo "ftp://ftp.1000genomes.ebi.ac.uk/vol1/"$file; done

I suspect you get add ncftpget or something similar to the end of this and fetch them all too

ADD REPLYlink written 8.2 years ago by Laura1.7k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1230 users visited in the last hour