Question: How To Find The List Of All Coding Snps / Exomic Variants In A Given Gene ?
4
gravatar for Khader Shameer
9.4 years ago by
Manhattan, NY
Khader Shameer18k wrote:

I have a list of genes (for example: SPTA1, THRB, PDGFRA, KIT, LRRC16A, SCGN). I am looking for a resource/way to find list of all coding SNPs (synonymous, non-synonymous, nonsense, missense, frameshift or any other relevant class of snps in coding region) in the exome region of these genes.

EDIT: I am not looking for a way to filter dbSNP/Ensembl variants in a gene using func/consequencetype. The idea is to get list SNPs which can have coding related func/consequencetype and then map the location to available exome data (CDS) to verify the location is in the exome region or not.

annotation variant snp • 5.9k views
ADD COMMENTlink modified 9.4 years ago by Ryan D3.3k • written 9.4 years ago by Khader Shameer18k
1

The answers below give solutions using UCSC. However, UCSC uses dbSNP131 only, while the 1000g has put many more in dbSNP132. If it were me, I would try to use dbSNP132.

ADD REPLYlink written 9.4 years ago by lh332k

Thanks a lot @lh3.

ADD REPLYlink written 9.2 years ago by Khader Shameer18k
3
gravatar for Pierre Lindenbaum
9.4 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum129k wrote:

Try to use the UCSC mysql server:

select distinct
S.name,
S.chromStart,
S.chromEnd,
S.func,
X.geneSymbol
from 
snp130 as S,
knownGene as K,
kgXref as X

where

X.kgId=K.name and
K.chrom=S.chrom and
K.txStart<=S.chromStart and
S.chromEnd<=K.txEnd and
X.geneSymbol in ('SPTA1', 'THRB', 'PDGFRA', 'KIT', 'LRRC16A', 'SCGN') and
S.func in ('coding-synon','nonsense','missense','frameshift')

Result:

mysql  -h  genome-mysql.cse.ucsc.edu -A -u genome -D hg18 < query.sql

ADD COMMENTlink modified 10 months ago by RamRS27k • written 9.4 years ago by Pierre Lindenbaum129k
1

AFAIK, there is no table only storing the exons data. You'll have to extract them from knownGene.exonStarts and knownGene.exonEnds

ADD REPLYlink modified 10 months ago by RamRS27k • written 9.4 years ago by Pierre Lindenbaum129k

Thanks Pierre, I tried something like this by reverse engineering your earlier answers. My concern is that how can I verify these are actually coding regions. Do you know about any table that store information about coding (for example NCBI CDS)/exomic regions of human genome to map this positions ?

ADD REPLYlink written 9.4 years ago by Khader Shameer18k

Thanks Pierre, I will try that option.

ADD REPLYlink written 9.4 years ago by Khader Shameer18k

-1. This query is very inefficient. If everyone queries the database like this, it will be a disaster.

ADD REPLYlink written 9.4 years ago by lh332k

-1. This query is very inefficient. We should stop populating wrong queries to benefit other UCSC MySQL users.

ADD REPLYlink written 9.4 years ago by lh332k

-1. Another inefficient UCSC query. Note that Start and End are not indexed. I do not know how UCSC performs such query, but I would use several SQLs instead of using table joining.

ADD REPLYlink written 9.4 years ago by lh332k
3
gravatar for William Spooner
9.4 years ago by
Cambridge
William Spooner300 wrote:

You can do this using Ensembl BioMart (using the Ensembl API is another option), http://www.ensembl.org/biomart/martview here is the XML representation;

<Dataset name = "hsapiens_snp" interface = "default" >
    <Filter name = "consequence_type" value = "3PRIME_UTR,3PRIME_UTR&amp;NMD_TRANSCRIPT,5PRIME_UTR,5PRIME_UTR&amp;NMD_TRANSCRIPT,COMPLEX_INDEL,COMPLEX_INDEL&amp;NMD_TRANSCRIPT,COMPLEX_INDEL&amp;SPLICE_SITE,ESSENTIAL_SPLICE_SITE&amp;INTRONIC,ESSENTIAL_SPLICE_SITE&amp;INTRONIC&amp;NMD_TRANSCRIPT,FRAMESHIFT_CODING,FRAMESHIFT_CODING&amp;NMD_TRANSCRIPT,FRAMESHIFT_CODING&amp;SPLICE_SITE,FRAMESHIFT_CODING&amp;SPLICE_SITE&amp;NMD_TRANSCRIPT,HGMD_MUTATION,INTRONIC&amp;NMD_TRANSCRIPT,NON_SYNONYMOUS_CODING,NON_SYNONYMOUS_CODING&amp;NMD_TRANSCRIPT,NON_SYNONYMOUS_CODING&amp;SPLICE_SITE,NON_SYNONYMOUS_CODING&amp;SPLICE_SITE&amp;NMD_TRANSCRIPT,PARTIAL_CODON,SPLICE_SITE&amp;3PRIME_UTR,SPLICE_SITE&amp;3PRIME_UTR&amp;NMD_TRANSCRIPT,SPLICE_SITE&amp;5PRIME_UTR,SPLICE_SITE&amp;5PRIME_UTR&amp;NMD_TRANSCRIPT,SPLICE_SITE&amp;INTRONIC,SPLICE_SITE&amp;INTRONIC&amp;NMD_TRANSCRIPT,SPLICE_SITE&amp;SYNONYMOUS_CODING,SPLICE_SITE&amp;SYNONYMOUS_CODING&amp;NMD_TRANSCRIPT,STOP_GAINED,STOP_GAINED&amp;FRAMESHIFT_CODING,STOP_GAINED&amp;FRAMESHIFT_CODING&amp;NMD_TRANSCRIPT,STOP_GAINED&amp;NMD_TRANSCRIPT,STOP_GAINED&amp;SPLICE_SITE,STOP_GAINED&amp;SPLICE_SITE&amp;NMD_TRANSCRIPT,STOP_LOST,STOP_LOST&amp;NMD_TRANSCRIPT,STOP_LOST&amp;SPLICE_SITE,STOP_LOST&amp;SPLICE_SITE&amp;NMD_TRANSCRIPT,SYNONYMOUS_CODING,SYNONYMOUS_CODING&amp;NMD_TRANSCRIPT,WITHIN_MATURE_miRNA,WITHIN_NON_CODING_GENE"/>
    <Attribute name = "refsnp_id" />
    <Attribute name = "chr_name" />
    <Attribute name = "chrom_start" />
    <Attribute name = "consequence_type_tv" />
    <Attribute name = "ensembl_transcript_stable_id" />
</Dataset>

<Dataset name = "hsapiens_gene_ensembl" interface = "default" >
    <Filter name = "hgnc_symbol" value = "SPTA1,THRB,PDGFRA,KIT,LRRC16A,SCGN"/>
    <Attribute name = "hgnc_symbol" />
</Dataset>
ADD COMMENTlink modified 10 months ago by RamRS27k • written 9.4 years ago by William Spooner300
1

you may then want to build a query to retrieve chromosome positions of all exons on your genes, and then query for SNPs on that regions. modifying the code from above to do so should not be complicated.

ADD REPLYlink written 9.4 years ago by Jorge Amigo11k

Thanks William. I am not looking for a way to filter SNPs based on consequence type. If we filter based on pre-defined consequence type, there are chances that some of these SNPs may fall in regulatory or intronic region. I am trying to find a way to filter SNPs then check whether they are in exome region or not. Sorry that it was not clear in my question.

ADD REPLYlink written 9.4 years ago by Khader Shameer18k

Yes Jorge, but my gut feeling was that the coding snp must have already in another table/db. And it is available as a separate table in snp131CodingDbSnp. Please see Ryan's answer.

ADD REPLYlink written 9.4 years ago by Khader Shameer18k
2
gravatar for Ryan D
9.4 years ago by
Ryan D3.3k
USA
Ryan D3.3k wrote:

I think Pierre is on the right track above. I would do it his way, but just pull from the table: snp131CodingDbSnp.

Database: hg19    Primary Table: snp131CodingDbSnp    Row Count: 443,544
Format description: Annotations of the effects of SNPs on translated protein sequence.

Give it a try. And thanks again, Pierre, for another great answer.

Explicitly:

mysql -h genome-mysql.cse.ucsc.edu -A -u genome -D hg19 > ~/scripts/query2.sql

query2.sql below

select
distinct
S.name,
S.chromStart,
S.chromEnd,
S.funcCodes,
X.geneSymbol
from
snp131CodingDbSnp as S,
knownGene as K,
kgXref as X

where

X.kgId=K.name and
K.chrom=S.chrom and
K.txStart<=S.chromStart and
S.chromEnd<=K.txEnd and
X.geneSymbol in ('SPTA1', 'THRB', 'PDGFRA', 'KIT', 'LRRC16A', 'SCGN')`
ADD COMMENTlink modified 10 months ago by RamRS27k • written 9.4 years ago by Ryan D3.3k

Incidentally, this table gives about the same results as Pierre's method. That concordance is probably a good thing.

ADD REPLYlink written 9.4 years ago by Ryan D3.3k

Thanks a lot Ryan. snp131CodingDbSnp is the one I was looking for.

ADD REPLYlink written 9.4 years ago by Khader Shameer18k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1122 users visited in the last hour