Question: How To Detect And Query Poly-Allelic Snps?
8
gravatar for Michael Dondrup
9.4 years ago by
Bergen, Norway
Michael Dondrup46k wrote:

Today I had a lunchtime discussion with some colleagues about the existence of poly-allelic Single Nucleotide Polymorphisms (SNPs). Quoting Wikipedia to explain what I mean:

For example, two sequenced DNA fragments from different individuals, AAGCCTA to AAGCTTA, contain a difference in a single nucleotide. In this case we say that there are two alleles : C and T. Almost all common SNPs have only two alleles.

My point was that there could be some but I don't know any SNPs that have eg. 3 alleles such that there is either e.g. A/G/T, or all 4 bases in a population. If I wish to query a database, say BioMart, UCSC, HapMap for this, how would I do this?

If wanted to detect poly-allelic SNPs from sequencing data (de novo) using a reference sequence, which method/program could be used?

Edit: So far this seems to be real great stuff, got at least two different methods of getting the query right! But what about de novo (sorry, of course a reference sequence can be used, that's not really de novo) detection? Would for example Maq detect tri-allelic SNPs in short-reads?

snp dbsnp allele biomart • 7.8k views
ADD COMMENTlink modified 11 months ago by RamRS23k • written 9.4 years ago by Michael Dondrup46k
1

That depend on your ploidy level. Cancer cell lines can be highly polyploid for certain regions/chromosomes. So, it's possible that a given locus carry 3 or more SNPs. That's common in cultivated plants, too. Wheat cultivars are typically dodecaploid. So, you question is relevant, but not easy.

ADD REPLYlink written 9.4 years ago by Jarretinha3.3k

When you say de novo, it means no reference sequence/population? Or smth like genotyping a intra-patient HIV population?

ADD REPLYlink written 9.4 years ago by Jarretinha3.3k

I meant using a reference sequence, maybe the use of de-novo is not very good here, I admit. MAQ can for example call SNPs from sequencing reads as far as I know. but would it also detect tri-allelic SNPS?

ADD REPLYlink written 9.4 years ago by Michael Dondrup46k

To correct myself, the 'de-novo' part of my question was stupid: of course, from a single individual at most two different alleles can be discovered. See my answer below.

ADD REPLYlink written 9.4 years ago by Michael Dondrup46k
6
gravatar for Yuri
9.4 years ago by
Yuri1.5k
Bethesda, MD
Yuri1.5k wrote:

You can query UCSC snp130 table. The fields of interest would be observed and class. You can use their public mysql server and run this sql query:

SELECT name,chrom,chromStart,chromEnd,observed,class,avHet
  FROM snp130
  WHERE length(observed)>3 and class="single" and avHet>0

I've added a condition to find only SNPs with average heterozygosity over zero. It finds 4949 SNPs in hg19, 490 are 4-allelic.

ADD COMMENTlink modified 11 months ago by RamRS23k • written 9.4 years ago by Yuri1.5k

Yeah! This works pretty well! But, if you search single and in-del you will find SNPs with indels. I don't know why they're not on single. And is rather strange to define poly-allelism with indels.

ADD REPLYlink written 9.4 years ago by Jarretinha3.3k

Yes, confirmed, got exactly the same number! Just need to call USE hg19; before. This method is equally valid.

ADD REPLYlink written 9.4 years ago by Michael Dondrup46k
3
gravatar for Jarretinha
9.4 years ago by
Jarretinha3.3k
São Paulo, Brazil
Jarretinha3.3k wrote:

This situation is rather cumbersome but quite common. Small deletions must be considered, too. There's a growing body of research on triallelic sites. You can check it yourself.

The only methodology I'm aware of is TriTyper described here. I've checked HapMart. They don't have a simple way to look for triallelic site!!! There are more than a thousand of these already detected in humans.

It's not hard to get the HapMap data and search for them. But, it's strange that it's not a default filter/attribute combination possibility.

Maybe Pierre could give you a tip?

--Edit--

After a somewhat long search for genotype calling, genotype imputation and poly-allelism detection I can conclude with a high degree of confidence: there is no such software. Yes. I was unable to find a software that explicitly treats the problem. Actual detection relies on brute force (i. .e, experimental verification) for which there are a lot of well developed methodologies. But, there are evidences that a large number of SNPs are miscalled right now. So, I do recommend a paper and two reviews about this:

Genotyping Technologies for Genetic Research - Review

Genotype imputation - Review

Human Triallelic Sites: Evidence for a New Mutational Mechanism? - Interesting Paper

At the beginning I thought that locating and naming a SNPs was a simple task. Now I can see that it's much more subtle. I check the SNPs definition. It includes indels too !!! So, genotype calling is a rather unexplored area. Most works just deal with biallelic sites for a given base.

-- Edit --

After researching little bit more I've stumble upon more software problems. All programs/algoritms/equations to estimate population genetics parameters (N_e, LD, recombination rate, etc) from SNPs data also assume biallelic loci with no indels.

-- Edit --

A paper with a partial solution.

Accurate detection and genotyping of SNPs utilizing population sequencing data

But, there is still more to it. I'm thinking about the problem.

ADD COMMENTlink modified 9.4 years ago • written 9.4 years ago by Jarretinha3.3k
3
gravatar for Pierre Lindenbaum
9.4 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum122k wrote:

Hum, not sure I understood your question. Do you just want to get a sub-list of all the non classical (di-allelic) snps ? In a previous post I described a small piece of code to run a SAX parser with a javascript analyser. I'll use it here but it could be any SAX implementation. The following script is used to scan the (big) XML files from dbSNP. It records the rs## and the 'non-di-allelic' observed mutations.

var rs=null;
var observed=null;
var content=null;
var pat=/[ATGC]\/[ATGC]/i;
function startElement(uri,localName,name,atts)
        {
        if(localName=="Rs" && observed==null)
                {
                rs="rs"+atts.getValue("rsId");
                }
        else if(localName=="Observed" && observed==null)
                {
                content="";
                }
        }
function characters(s)
        {
        if(content!=null) content+=s;
        }

function endElement(uri,localName,name)
        {
        if(localName=="Rs")
                {
                if(!observed.match(pat))
                        {
                        println(rs+"\t"+observed);
                        }
                rs=null;
                observed=null;
                }
        else if(observed==null && localName=="Observed")
                {
                observed=content;
                }
        content=null;
        }

invoking the SAX/js script:

java -jar saxscript.jar  -f scansnps.js  ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/XML/ds_ch1.xml.gz

result:

rs242   -/T
rs1433  -/A
rs3483  -/ATTT
rs4179  -/CAGA
rs6485  -/AAT
rs7563  -/AA
rs16354 -/AACCCA
rs16356 -/ATT
rs16357 -/TGTGAA
rs16358 -/ATAA
rs16359 -/GA
rs16360 -/TA
rs16361 -/CT
rs16439 -/CAGA
rs16626 -/CGTGAAGTCC
rs16631 -/TA
rs16644 -/AAC
rs16670 -/TTTAA
rs16679 -/CT
rs16684 -/CTCTGGC
rs16686 -/AA
rs16725 -/ACAAA
rs16729 -/TCAAT
rs20418 -/TAG
rs20421 -/CGG
rs20431 -/T
rs25569 -/AC
rs25571 -/TT
rs140711        -/AG
rs140758        -/GAAA
rs140798        -/TTGT
rs140838        -/TCGT
rs140840        -/TT
rs140841        -/TAACTA
rs140849        -/GT
rs140858        -/CT
rs140861        -/CACT
rs140862        -/TAAG
rs140864        -/GAA
rs140865        -/ACTACATGA
rs171241        -/C
rs209574        -/GC
rs239965        -/T
rs284043        -/G
rs301740        -/TCTC
rs316275        -/T
rs319679        -/GTAG
rs332778        -/T
rs350183        -/A
rs365861        -/A
rs366664        -/A
rs390580        -/CTCTCT
rs391526        -/T
rs393900        -/T
rs398196        -/A
rs410422        -/TT
rs420240        -/T
rs431835        -/AGATAT
rs435486        -/AAAAA
rs446074        -/A
rs446693        -/GA
rs480306        -/TGG
rs481407        -/TATT
rs485475        -/CAACAACAC
rs486207        -/G
(....)

hope it helps

ADD COMMENTlink modified 11 months ago by RamRS23k • written 9.4 years ago by Pierre Lindenbaum122k

You can get a similar results with UCSC exchanging class="single" by class="in-del". Of course you could use class="single" OR class="in-del". But, indels aren't treat like SNPs despite some being SNPs in a HapMap sense. You could have for instance a A/T/G/AT. An -/A/T make sense but I haven't found any.

ADD REPLYlink written 9.4 years ago by Jarretinha3.3k

this SAX script can be modified for example, with the very same script you can find the positions of each snps on any assembly by looking at the MapLoc tag. And this information is always available , no need to wait for the UCSC to upload & digest it.

ADD REPLYlink written 9.4 years ago by Pierre Lindenbaum122k

That's great! Didn't considered that point. Saw your tweet on dbSNP131, too. But somehow your scansnps.js isn't working here. Encoding complain. Tested the query on PubMed and still works fine.

ADD REPLYlink written 9.4 years ago by Jarretinha3.3k

ah, the jar on google code is old ( not gzip). Let me update it...

ADD REPLYlink written 9.4 years ago by Pierre Lindenbaum122k

ok, I updated the jar file http://lindenb.googlecode.com/files/saxscript.jar

ADD REPLYlink written 9.4 years ago by Pierre Lindenbaum122k

I don't think it's your code. Smth local. Tested the update version and still issues the same error. All encodings are UTF-8. Tested 3 java versions/flavors and nothing. :(

ADD REPLYlink written 9.4 years ago by Jarretinha3.3k

Yes that works fine! Tryed ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/XML/ds_ch22.xml.gz (about 200MB) There is a little typo in your ftp address under invoking the SAX/js script: ftp:/tp.ncbi.nih.gov/... should be ftp://ftp.ncbi.nih.gov. Maybe that was Jarretinha's problem?

ADD REPLYlink written 9.4 years ago by Michael Dondrup46k

thanks, I fixed the typo

ADD REPLYlink written 9.4 years ago by Pierre Lindenbaum122k

I also fixed the script. (added condition 'observed==null') : the genotypes are now the first (digested) version of the possible mutations.

ADD REPLYlink written 9.4 years ago by Pierre Lindenbaum122k

No, I'm using a local version of the XML files. Really don't understand what's wrong. It's probably smth I missed in my java installation.

ADD REPLYlink written 9.4 years ago by Jarretinha3.3k
3
gravatar for Michael Dondrup
9.4 years ago by
Bergen, Norway
Michael Dondrup46k wrote:

Regarding the SNP detection part of my question I just read through the Maq-FAQ. If there are poly-allelic SNPs in the databases, they must have been detect somehow, and yes, Maq can be used to detect poly-allelic SNPs, at least according to the user manual the call is:

maq cns2snp consensus.cns >cns.snp

Extract list of SNPs

Then following the FAQ:

Consensus Calling

  • What do those "S", "M" and so on mean in the cns2snp output?

They are IUB codes for heterozygotes. Briefly:

M=A/C, K=G/T, Y=C/T, R=A/G, W=A/T, S=G/C,
D=A/G/T, B=C/G/T, H=A/C/T, V=A/C/G,
N=A/C/G/T
  

I still have to test this on real data but at least in theory this should work and all types of SNPs can be predicted. This does not tell you anything about the validity of the called SNPs though. Should have a look at the papers mentioned by Jarretinha.

Edit: SNP-detection

Did some more reading/thinking to get this right.

In principle polymorphism are studied by taking samples from members of (possibly multiple) populations (see e.g.: NCBI SNP primer, HapMap, Nature (2005)). If a second allele for a genomic position is prevalent in a significant (however this is defined, e.g. there was a >=1% criterium) part of the population, then it becomes a SNP. If a third or fourth allele is discovered at the same locus and meeting the detection criteria and is submitted, then this becomes what we find in the databases given the searches above.

To detect point mutations from a single sample by high-throughput sequencing is rather new and something very different.

Can this be called a SNP? Not immediately, because the prevalence in a population is not assessed. As Jarretinha stated, the number of point mutations that can be found for a single position depends on the ploidy of the organism. For human somatic cells (diploid) there at most two different alleles possible in the consensus sequence (found by Maq) if the marker is heterozygous (e.g. A/C). If the reference is different at that point, that might give rise to (e.g. T/A/C) that there exist are more than three alleles.

If the sample has higher ploidy, or of course due to sequencing/alignment errors, then more than more heterozygotes can be in the consensus, and that is why aligners support this.

ADD COMMENTlink modified 11 months ago by RamRS23k • written 9.4 years ago by Michael Dondrup46k

I know that MAQ can call a given SNP over a reference. But, to detect a poly-allelic site is way more complicated. Consider the HIV case. Now you have a collection of reads from different subjects (i.e. a population sample) and needs to decide if the variation in a given locus is poly-allelic. How does MAQ do that?

ADD REPLYlink written 9.4 years ago by Jarretinha3.3k

It does not. It is not possible to detect more than two allels in one individual! I was misunderstanding the problem of SNP detection.

ADD REPLYlink written 9.4 years ago by Michael Dondrup46k

From the population genetics point of view SNP detection is quite complicated. It depends on sample size, effective population size and other hard-to-measure parameters. Using very conservative parameters to estimate the expected number of SNPs in humans give us around 182000. So, there is a lot of space in this direction. And no software in hand. This will be FUN!!!

ADD REPLYlink written 9.4 years ago by Jarretinha3.3k

Do you see? Population genetics and bioinformatics were made for each other :)

ADD REPLYlink written 9.4 years ago by Jarretinha3.3k
1
gravatar for Jorge Amigo
9.0 years ago by
Jorge Amigo11k
Santiago de Compostela, Spain
Jorge Amigo11k wrote:

when talking about tri-allelic or even tetra-allelic SNPs one should take into deep consideration the resources to use. the largest SNP repositories around like dbSNP or Ensembl have some population data, although it is quite limited. although most of the currently known SNPs are considered to be bi-allelic, the only way of being sure that there isn't a third or fourth allele around is no other one than to genotype it in more and more samples. even though this is the only thing we can do right now, trying to find poly-allelic sites using current stablished repositories is not the best idea since the genotyping biases that all these repositories had since they were created have limited their use them for certain queries, poly-allelic sites in particular.

although HapMap is great population genetics resource, its bias for detecting bi-allelic SNPs have led them not to detect many sites that are now known as tri-allelic, as mentioned on other answers here. the 1000 Genomes project aims to reveal such data, as they are aiming to detect variants with very low frequency (below 1%), so it will be IMHO the best place where to look and find such information. my group is currently working with 1000 Genomes pilot data, and although it is not final data we have detected over 10K tri-allelic sites on the human genome, so although it hasn't been published yet I guess you can just have this 1 tri-allelic per 1000 SNPs as a valid rough figure in your mind when you think about poly-allelic SNPs.

when doing de novo sequencing you cannot detect more than 2 possibilities for each SNP in a single sequence, so unless I have missunderstood your point you will never be able to detect a third allele from de novo sequencing unless comparing your sequence against a reference (sequence, sequences, variant list, ...) or againts other sequences.

ADD COMMENTlink modified 9.0 years ago • written 9.0 years ago by Jorge Amigo11k
1
gravatar for Larry_Parnell
9.0 years ago by
Larry_Parnell16k
Boston, MA USA
Larry_Parnell16k wrote:

This is a response on why you want to find such tri- and multi-allelic SNPs. The m301 SNP (301 bp upstream of start site of transcription) of the CRP gene is tri-allelic. Not only does the SNP associate with CRP levels in plasma - CRP is a measure of inflammation - it also shows allele-specific interactions with fenofibrate, a drug taken to lower blood triglyceride levels, as pertaining to the reduction of CRP after a 3-week intervention of fenofibrate.

ADD COMMENTlink written 9.0 years ago by Larry_Parnell16k
1
gravatar for Erik Garrison
6.5 years ago by
Erik Garrison2.2k
Somerville, MA
Erik Garrison2.2k wrote:

Although most variant detectors try to force everything into a biallelic model, polyallelic loci are surprisingly common in the genome, particularly as you consider haplotypes of increasing length. The accurate detection of these has been a major point of development in FreeBayes, and I would suggest its use in this context.

ADD COMMENTlink written 6.5 years ago by Erik Garrison2.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 523 users visited in the last hour