Question

How To Find The List Of All Coding Snps / Exomic Variants In A Given Gene ?

4

Entering edit mode

13.2 years ago

Khader Shameer 18k

I have a list of genes (for example: SPTA1, THRB, PDGFRA, KIT, LRRC16A, SCGN). I am looking for a resource/way to find list of all coding SNPs (synonymous, non-synonymous, nonsense, missense, frameshift or any other relevant class of snps in coding region) in the exome region of these genes.

EDIT: I am not looking for a way to filter dbSNP/Ensembl variants in a gene using func/consequencetype. The idea is to get list SNPs which can have coding related func/consequencetype and then map the location to available exome data (CDS) to verify the location is in the exome region or not.

variant snp annotation • 7.5k views

ADD COMMENT • link updated 13.2 years ago by Ryan D ★ 3.4k • written 13.2 years ago by Khader Shameer 18k

1

Entering edit mode

The answers below give solutions using UCSC. However, UCSC uses dbSNP131 only, while the 1000g has put many more in dbSNP132. If it were me, I would try to use dbSNP132.

ADD REPLY • link 13.2 years ago by lh3 33k

0

Entering edit mode

Thanks a lot @lh3.

ADD REPLY • link 12.9 years ago by Khader Shameer 18k

Ram · Answer 1 · 2011-02-16

3

Entering edit mode

13.2 years ago

Pierre Lindenbaum 161k

Try to use the UCSC mysql server:

select distinct
S.name,
S.chromStart,
S.chromEnd,
S.func,
X.geneSymbol
from 
snp130 as S,
knownGene as K,
kgXref as X

where

X.kgId=K.name and
K.chrom=S.chrom and
K.txStart<=S.chromStart and
S.chromEnd<=K.txEnd and
X.geneSymbol in ('SPTA1', 'THRB', 'PDGFRA', 'KIT', 'LRRC16A', 'SCGN') and
S.func in ('coding-synon','nonsense','missense','frameshift')

Result:

mysql  -h  genome-mysql.cse.ucsc.edu -A -u genome -D hg18 < query.sql

ADD COMMENT • link updated 4.6 years ago by Ram 43k • written 13.2 years ago by Pierre Lindenbaum 161k

1

Entering edit mode

AFAIK, there is no table only storing the exons data. You'll have to extract them from knownGene.exonStarts and knownGene.exonEnds

ADD REPLY • link updated 4.6 years ago by Ram 43k • written 13.2 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Thanks Pierre, I tried something like this by reverse engineering your earlier answers. My concern is that how can I verify these are actually coding regions. Do you know about any table that store information about coding (for example NCBI CDS)/exomic regions of human genome to map this positions ?

ADD REPLY • link 13.2 years ago by Khader Shameer 18k

0

Entering edit mode

Thanks Pierre, I will try that option.

ADD REPLY • link 13.2 years ago by Khader Shameer 18k

0

Entering edit mode

-1. This query is very inefficient. If everyone queries the database like this, it will be a disaster.

ADD REPLY • link 13.2 years ago by lh3 33k

0

Entering edit mode

-1. This query is very inefficient. We should stop populating wrong queries to benefit other UCSC MySQL users.

ADD REPLY • link 13.2 years ago by lh3 33k

0

Entering edit mode

-1. Another inefficient UCSC query. Note that Start and End are not indexed. I do not know how UCSC performs such query, but I would use several SQLs instead of using table joining.

ADD REPLY • link 13.2 years ago by lh3 33k

Ram · Answer 2 · 2011-02-16

You can do this using Ensembl BioMart (using the Ensembl API is another option), http://www.ensembl.org/biomart/martview here is the XML representation;

<Dataset name = "hsapiens_snp" interface = "default" >
    <Filter name = "consequence_type" value = "3PRIME_UTR,3PRIME_UTR&amp;NMD_TRANSCRIPT,5PRIME_UTR,5PRIME_UTR&amp;NMD_TRANSCRIPT,COMPLEX_INDEL,COMPLEX_INDEL&amp;NMD_TRANSCRIPT,COMPLEX_INDEL&amp;SPLICE_SITE,ESSENTIAL_SPLICE_SITE&amp;INTRONIC,ESSENTIAL_SPLICE_SITE&amp;INTRONIC&amp;NMD_TRANSCRIPT,FRAMESHIFT_CODING,FRAMESHIFT_CODING&amp;NMD_TRANSCRIPT,FRAMESHIFT_CODING&amp;SPLICE_SITE,FRAMESHIFT_CODING&amp;SPLICE_SITE&amp;NMD_TRANSCRIPT,HGMD_MUTATION,INTRONIC&amp;NMD_TRANSCRIPT,NON_SYNONYMOUS_CODING,NON_SYNONYMOUS_CODING&amp;NMD_TRANSCRIPT,NON_SYNONYMOUS_CODING&amp;SPLICE_SITE,NON_SYNONYMOUS_CODING&amp;SPLICE_SITE&amp;NMD_TRANSCRIPT,PARTIAL_CODON,SPLICE_SITE&amp;3PRIME_UTR,SPLICE_SITE&amp;3PRIME_UTR&amp;NMD_TRANSCRIPT,SPLICE_SITE&amp;5PRIME_UTR,SPLICE_SITE&amp;5PRIME_UTR&amp;NMD_TRANSCRIPT,SPLICE_SITE&amp;INTRONIC,SPLICE_SITE&amp;INTRONIC&amp;NMD_TRANSCRIPT,SPLICE_SITE&amp;SYNONYMOUS_CODING,SPLICE_SITE&amp;SYNONYMOUS_CODING&amp;NMD_TRANSCRIPT,STOP_GAINED,STOP_GAINED&amp;FRAMESHIFT_CODING,STOP_GAINED&amp;FRAMESHIFT_CODING&amp;NMD_TRANSCRIPT,STOP_GAINED&amp;NMD_TRANSCRIPT,STOP_GAINED&amp;SPLICE_SITE,STOP_GAINED&amp;SPLICE_SITE&amp;NMD_TRANSCRIPT,STOP_LOST,STOP_LOST&amp;NMD_TRANSCRIPT,STOP_LOST&amp;SPLICE_SITE,STOP_LOST&amp;SPLICE_SITE&amp;NMD_TRANSCRIPT,SYNONYMOUS_CODING,SYNONYMOUS_CODING&amp;NMD_TRANSCRIPT,WITHIN_MATURE_miRNA,WITHIN_NON_CODING_GENE"/>
    <Attribute name = "refsnp_id" />
    <Attribute name = "chr_name" />
    <Attribute name = "chrom_start" />
    <Attribute name = "consequence_type_tv" />
    <Attribute name = "ensembl_transcript_stable_id" />
</Dataset>

<Dataset name = "hsapiens_gene_ensembl" interface = "default" >
    <Filter name = "hgnc_symbol" value = "SPTA1,THRB,PDGFRA,KIT,LRRC16A,SCGN"/>
    <Attribute name = "hgnc_symbol" />
</Dataset>

Ram · Answer 3 · 2011-02-16

2

Entering edit mode

13.2 years ago

Ryan D ★ 3.4k

I think Pierre is on the right track above. I would do it his way, but just pull from the table: snp131CodingDbSnp.

Database: hg19    Primary Table: snp131CodingDbSnp    Row Count: 443,544
Format description: Annotations of the effects of SNPs on translated protein sequence.

Give it a try. And thanks again, Pierre, for another great answer.

Explicitly:

mysql -h genome-mysql.cse.ucsc.edu -A -u genome -D hg19 > ~/scripts/query2.sql

query2.sql below

select
distinct
S.name,
S.chromStart,
S.chromEnd,
S.funcCodes,
X.geneSymbol
from
snp131CodingDbSnp as S,
knownGene as K,
kgXref as X

where

X.kgId=K.name and
K.chrom=S.chrom and
K.txStart<=S.chromStart and
S.chromEnd<=K.txEnd and
X.geneSymbol in ('SPTA1', 'THRB', 'PDGFRA', 'KIT', 'LRRC16A', 'SCGN')`

ADD COMMENT • link updated 4.6 years ago by Ram 43k • written 13.2 years ago by Ryan D ★ 3.4k

0

Entering edit mode

Incidentally, this table gives about the same results as Pierre's method. That concordance is probably a good thing.