How To Find The List Of All Coding Snps / Exomic Variants In A Given Gene ?
3
4
Entering edit mode
13.2 years ago

I have a list of genes (for example: SPTA1, THRB, PDGFRA, KIT, LRRC16A, SCGN). I am looking for a resource/way to find list of all coding SNPs (synonymous, non-synonymous, nonsense, missense, frameshift or any other relevant class of snps in coding region) in the exome region of these genes.

EDIT: I am not looking for a way to filter dbSNP/Ensembl variants in a gene using func/consequencetype. The idea is to get list SNPs which can have coding related func/consequencetype and then map the location to available exome data (CDS) to verify the location is in the exome region or not.

variant snp annotation • 7.5k views
ADD COMMENT
1
Entering edit mode

The answers below give solutions using UCSC. However, UCSC uses dbSNP131 only, while the 1000g has put many more in dbSNP132. If it were me, I would try to use dbSNP132.

ADD REPLY
0
Entering edit mode

Thanks a lot @lh3.

ADD REPLY
3
Entering edit mode
13.2 years ago

Try to use the UCSC mysql server:

select distinct
S.name,
S.chromStart,
S.chromEnd,
S.func,
X.geneSymbol
from 
snp130 as S,
knownGene as K,
kgXref as X

where

X.kgId=K.name and
K.chrom=S.chrom and
K.txStart<=S.chromStart and
S.chromEnd<=K.txEnd and
X.geneSymbol in ('SPTA1', 'THRB', 'PDGFRA', 'KIT', 'LRRC16A', 'SCGN') and
S.func in ('coding-synon','nonsense','missense','frameshift')

Result:

mysql  -h  genome-mysql.cse.ucsc.edu -A -u genome -D hg18 < query.sql

ADD COMMENT
1
Entering edit mode

AFAIK, there is no table only storing the exons data. You'll have to extract them from knownGene.exonStarts and knownGene.exonEnds

ADD REPLY
0
Entering edit mode

Thanks Pierre, I tried something like this by reverse engineering your earlier answers. My concern is that how can I verify these are actually coding regions. Do you know about any table that store information about coding (for example NCBI CDS)/exomic regions of human genome to map this positions ?

ADD REPLY
0
Entering edit mode

Thanks Pierre, I will try that option.

ADD REPLY
0
Entering edit mode

-1. This query is very inefficient. If everyone queries the database like this, it will be a disaster.

ADD REPLY
0
Entering edit mode

-1. This query is very inefficient. We should stop populating wrong queries to benefit other UCSC MySQL users.

ADD REPLY
0
Entering edit mode

-1. Another inefficient UCSC query. Note that Start and End are not indexed. I do not know how UCSC performs such query, but I would use several SQLs instead of using table joining.

ADD REPLY
3
Entering edit mode
13.2 years ago

You can do this using Ensembl BioMart (using the Ensembl API is another option), http://www.ensembl.org/biomart/martview here is the XML representation;

<Dataset name = "hsapiens_snp" interface = "default" >
    <Filter name = "consequence_type" value = "3PRIME_UTR,3PRIME_UTR&amp;NMD_TRANSCRIPT,5PRIME_UTR,5PRIME_UTR&amp;NMD_TRANSCRIPT,COMPLEX_INDEL,COMPLEX_INDEL&amp;NMD_TRANSCRIPT,COMPLEX_INDEL&amp;SPLICE_SITE,ESSENTIAL_SPLICE_SITE&amp;INTRONIC,ESSENTIAL_SPLICE_SITE&amp;INTRONIC&amp;NMD_TRANSCRIPT,FRAMESHIFT_CODING,FRAMESHIFT_CODING&amp;NMD_TRANSCRIPT,FRAMESHIFT_CODING&amp;SPLICE_SITE,FRAMESHIFT_CODING&amp;SPLICE_SITE&amp;NMD_TRANSCRIPT,HGMD_MUTATION,INTRONIC&amp;NMD_TRANSCRIPT,NON_SYNONYMOUS_CODING,NON_SYNONYMOUS_CODING&amp;NMD_TRANSCRIPT,NON_SYNONYMOUS_CODING&amp;SPLICE_SITE,NON_SYNONYMOUS_CODING&amp;SPLICE_SITE&amp;NMD_TRANSCRIPT,PARTIAL_CODON,SPLICE_SITE&amp;3PRIME_UTR,SPLICE_SITE&amp;3PRIME_UTR&amp;NMD_TRANSCRIPT,SPLICE_SITE&amp;5PRIME_UTR,SPLICE_SITE&amp;5PRIME_UTR&amp;NMD_TRANSCRIPT,SPLICE_SITE&amp;INTRONIC,SPLICE_SITE&amp;INTRONIC&amp;NMD_TRANSCRIPT,SPLICE_SITE&amp;SYNONYMOUS_CODING,SPLICE_SITE&amp;SYNONYMOUS_CODING&amp;NMD_TRANSCRIPT,STOP_GAINED,STOP_GAINED&amp;FRAMESHIFT_CODING,STOP_GAINED&amp;FRAMESHIFT_CODING&amp;NMD_TRANSCRIPT,STOP_GAINED&amp;NMD_TRANSCRIPT,STOP_GAINED&amp;SPLICE_SITE,STOP_GAINED&amp;SPLICE_SITE&amp;NMD_TRANSCRIPT,STOP_LOST,STOP_LOST&amp;NMD_TRANSCRIPT,STOP_LOST&amp;SPLICE_SITE,STOP_LOST&amp;SPLICE_SITE&amp;NMD_TRANSCRIPT,SYNONYMOUS_CODING,SYNONYMOUS_CODING&amp;NMD_TRANSCRIPT,WITHIN_MATURE_miRNA,WITHIN_NON_CODING_GENE"/>
    <Attribute name = "refsnp_id" />
    <Attribute name = "chr_name" />
    <Attribute name = "chrom_start" />
    <Attribute name = "consequence_type_tv" />
    <Attribute name = "ensembl_transcript_stable_id" />
</Dataset>

<Dataset name = "hsapiens_gene_ensembl" interface = "default" >
    <Filter name = "hgnc_symbol" value = "SPTA1,THRB,PDGFRA,KIT,LRRC16A,SCGN"/>
    <Attribute name = "hgnc_symbol" />
</Dataset>
ADD COMMENT
1
Entering edit mode

you may then want to build a query to retrieve chromosome positions of all exons on your genes, and then query for SNPs on that regions. modifying the code from above to do so should not be complicated.

ADD REPLY
0
Entering edit mode

Thanks William. I am not looking for a way to filter SNPs based on consequence type. If we filter based on pre-defined consequence type, there are chances that some of these SNPs may fall in regulatory or intronic region. I am trying to find a way to filter SNPs then check whether they are in exome region or not. Sorry that it was not clear in my question.

ADD REPLY
0
Entering edit mode

Yes Jorge, but my gut feeling was that the coding snp must have already in another table/db. And it is available as a separate table in snp131CodingDbSnp. Please see Ryan's answer.

ADD REPLY
2
Entering edit mode
13.2 years ago
Ryan D ★ 3.4k

I think Pierre is on the right track above. I would do it his way, but just pull from the table: snp131CodingDbSnp.

Database: hg19    Primary Table: snp131CodingDbSnp    Row Count: 443,544
Format description: Annotations of the effects of SNPs on translated protein sequence.

Give it a try. And thanks again, Pierre, for another great answer.

Explicitly:

mysql -h genome-mysql.cse.ucsc.edu -A -u genome -D hg19 > ~/scripts/query2.sql

query2.sql below

select
distinct
S.name,
S.chromStart,
S.chromEnd,
S.funcCodes,
X.geneSymbol
from
snp131CodingDbSnp as S,
knownGene as K,
kgXref as X

where

X.kgId=K.name and
K.chrom=S.chrom and
K.txStart<=S.chromStart and
S.chromEnd<=K.txEnd and
X.geneSymbol in ('SPTA1', 'THRB', 'PDGFRA', 'KIT', 'LRRC16A', 'SCGN')`
ADD COMMENT
0
Entering edit mode

Incidentally, this table gives about the same results as Pierre's method. That concordance is probably a good thing.

ADD REPLY
0
Entering edit mode

Thanks a lot Ryan. snp131CodingDbSnp is the one I was looking for.

ADD REPLY

Login before adding your answer.

Traffic: 2393 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6