Determining Which New Snps In 1000G Data Result In Coding Changes
6
7
Entering edit mode
13.5 years ago
Ryan D ★ 3.4k

As you all know the 1000G sites are a sheer freaking delight to navigate.

But I understand from here: http://www.1000genomes.org/page.php?page=announcements that the exonic sequences are available, but only cover 8,140 exons in 906 "randomly selected" genes: ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/pilot_data/technical/working/20100511_snp_annotation/pilot3/annotated.CHB.final.P3.

The paper just released has info on the 1000G pilot: http://www.1000genomes.org/bcms/1000_genomes/Documents/nature09534.pdf

But in addition to mining the variants in LD with those significant in our GWAS using haploxt: http://www.sph.umich.edu/csg/abecasis/GOLD/docs/haploxt.html

what is the best way to determine if any variants in LD map to exons and cause coding changes? Is there a tool like Polyphen through which they could be run in batches?

genome linkage alignment protein prediction • 4.2k views
ADD COMMENT
1
Entering edit mode

In the 1000G paper it says: “In total, we found 68,300 non-synonymous SNPs, 34,161 of which were novel.”

Presumably a table must exist to get such a number which has genome positions, rs#s, and predictions for which variants are functional.

If anyone could find that table or that data on the 1000G website, that would solve this problem.

ADD REPLY
6
Entering edit mode
13.5 years ago

There is also snpEff http://snpeff.sourceforge.net/ and annovar http://www.openbioinformatics.org/annovar/ which can help with the snp effect prediction. They're both along the same lines as consequence (Pierre's program), they just output more info I think.

They also have the capability of having external GFF annotations added to the result.

ADD COMMENT
1
Entering edit mode

It's interesting because the first tool has been integrated into galaxy....

ADD REPLY
4
Entering edit mode
13.5 years ago

Polyphen has a 'batch' mode : http://genetics.bwh.harvard.edu/pph2/bgi.shtml

There is an API developed by/for Ensembl :

Bioinformatics. 2010 Aug 15;26(16):2069-70. Epub 2010 Jun 18. Deriving the consequences of genomic variants with the Ensembl API and SNP Effect Predictor.

On my side, last year, I wrote a simple program using the UCSC 'knownGene' table to predict the consequences of the mutations. A early version was described here.

ADD COMMENT
0
Entering edit mode

This is really good, Pierre. It appears that it takes input like the kind I could provide (e.g. chr1:1158631, rs11689281). Now I just need someone to hook me up with the elusive tables the authors do not seem to have provided.

The paper supplementary material says: "The coordinates and predicted functional consequences of all of the LOF variants identified in the project are available on the 1000 Genomes FTP site."

ADD REPLY
4
Entering edit mode
13.5 years ago
Laura ★ 1.8k

You can also use the ensembl snp effect predictor.

The lost of function variants are all annotated as part of the standard paper data set which can be found on the 1000genomes ftp site

There are a lot of files to navigate but as this represents a lot of data it was felt this was the best way to distribute it

As the code in a comment to another answer seems to of been a little mangled here it is again, a quick one liner with a bash foreach loop and our current.tree file which gets you all the files based on a simple grep

for file in `grep LOF current.tree | cut -f 1  | grep a_map_of_human_variation`; do wget "ftp://ftp.1000genomes.ebi.ac.uk/vol1/"$file; done
ADD COMMENT
3
Entering edit mode
13.5 years ago
Neilfws 49k

Regarding the sentence:

The coordinates and predicted functional consequences of all of the LOF variants identified in the project are available on the 1000 Genomes FTP site.

I suspect that these are the files named .LOF.txt.gz (or just .LOF.txt). They seem to be scattered through various directories at the FTP site.

For example this FTP directory contains "README.2010_07.lof_variants", with LOF files in the exon/, low_coverage/ and trio/ sub-directories (and in fact, more sub-directories therein, e.g. exon/snps, exon/indels). The directory for data from the paper seems to have a similar structure.

You may just have to navigate through the FTP site, taking notes and reading README files until you find what you want. Or I guess, email the authors and ask for a direct link to the LOF data.

ADD COMMENT
1
Entering edit mode
13.5 years ago
Deniz ▴ 140

A good way to do it would be to parse the VCF files provided at the 1000 genomes website, and use the fields provided inside the files to filter according to your needs.

ADD COMMENT
1
Entering edit mode
13.5 years ago

You wrote, "In the 1000G paper it says: “In total, we found 68,300 non-synonymous SNPs, 34,161 of which were novel.” Presumably a table must exist to get such a number which has genome positions, rs#s, and predictions for which variants are functional. If anyone could find that table or that data on the 1000G website, that would solve this problem."

I would write to Daniel MacArthur and ask him for that table. He is the one who worked on the loss of function variants identified in the 1000G data and presented this at ASHG last week. He is on Twitter as dgmacarthur.

ADD COMMENT
3
Entering edit mode

no need to write to Daniel, the data is all on the ftp site

If you look at the current tree file at the root of the ftp site, you can quickly find the ftp path of all the appropriate files

laura@1000genomes[~]:for file in `grep LOF current.tree | cut -f 1  | grep a_map_of_human_variation`; do echo "ftp://ftp.1000genomes.ebi.ac.uk/vol1/"$file; done

I suspect you get add ncftpget or something similar to the end of this and fetch them all too

ADD REPLY

Login before adding your answer.

Traffic: 2389 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6