Question

How To Detect And Query Poly-Allelic Snps?

8

Entering edit mode

15.3 years ago

Michael 55k

Today I had a lunchtime discussion with some colleagues about the existence of poly-allelic Single Nucleotide Polymorphisms (SNPs). Quoting Wikipedia to explain what I mean:

For example, two sequenced DNA fragments from different individuals, AAGCCTA to AAGCTTA, contain a difference in a single nucleotide. In this case we say that there are two alleles : C and T. Almost all common SNPs have only two alleles.

My point was that there could be some but I don't know any SNPs that have eg. 3 alleles such that there is either e.g. A/G/T, or all 4 bases in a population. If I wish to query a database, say BioMart, UCSC, HapMap for this, how would I do this?

If wanted to detect poly-allelic SNPs from sequencing data (de novo) using a reference sequence, which method/program could be used?

Edit: So far this seems to be real great stuff, got at least two different methods of getting the query right! But what about de novo (sorry, of course a reference sequence can be used, that's not really de novo) detection? Would for example Maq detect tri-allelic SNPs in short-reads?

snp allele biomart dbsnp • 12k views

ADD COMMENT • link updated 20 months ago by Ram 45k • written 15.3 years ago by Michael 55k

1

Entering edit mode

That depend on your ploidy level. Cancer cell lines can be highly polyploid for certain regions/chromosomes. So, it's possible that a given locus carry 3 or more SNPs. That's common in cultivated plants, too. Wheat cultivars are typically dodecaploid. So, you question is relevant, but not easy.

ADD REPLY • link 15.3 years ago by Jarretinha 3.5k

0

Entering edit mode

When you say de novo, it means no reference sequence/population? Or smth like genotyping a intra-patient HIV population?

ADD REPLY • link 15.3 years ago by Jarretinha 3.5k

0

Entering edit mode

I meant using a reference sequence, maybe the use of de-novo is not very good here, I admit. MAQ can for example call SNPs from sequencing reads as far as I know. but would it also detect tri-allelic SNPS?

ADD REPLY • link 15.3 years ago by Michael 55k

0

Entering edit mode

To correct myself, the 'de-novo' part of my question was stupid: of course, from a single individual at most two different alleles can be discovered. See my answer below.

ADD REPLY • link 15.3 years ago by Michael 55k

3

Entering edit mode

15.3 years ago

Jarretinha 3.5k

This situation is rather cumbersome but quite common. Small deletions must be considered, too. There's a growing body of research on triallelic sites. You can check it yourself.

The only methodology I'm aware of is TriTyper described here. I've checked HapMart. They don't have a simple way to look for triallelic site!!! There are more than a thousand of these already detected in humans.

It's not hard to get the HapMap data and search for them. But, it's strange that it's not a default filter/attribute combination possibility.

Maybe Pierre could give you a tip?

--Edit--

After a somewhat long search for genotype calling, genotype imputation and poly-allelism detection I can conclude with a high degree of confidence: there is no such software. Yes. I was unable to find a software that explicitly treats the problem. Actual detection relies on brute force (i. .e, experimental verification) for which there are a lot of well developed methodologies. But, there are evidences that a large number of SNPs are miscalled right now. So, I do recommend a paper and two reviews about this:

Genotyping Technologies for Genetic Research - Review

Genotype imputation - Review

Human Triallelic Sites: Evidence for a New Mutational Mechanism? - Interesting Paper

At the beginning I thought that locating and naming a SNPs was a simple task. Now I can see that it's much more subtle. I check the SNPs definition. It includes indels too !!! So, genotype calling is a rather unexplored area. Most works just deal with biallelic sites for a given base.

-- Edit --

After researching little bit more I've stumble upon more software problems. All programs/algoritms/equations to estimate population genetics parameters (N_e, LD, recombination rate, etc) from SNPs data also assume biallelic loci with no indels.

-- Edit --

A paper with a partial solution.

Accurate detection and genotyping of SNPs utilizing population sequencing data

But, there is still more to it. I'm thinking about the problem.

ADD COMMENT • link updated 20 months ago by Ram 45k • written 15.3 years ago by Jarretinha 3.5k

3

Entering edit mode

15.3 years ago

Pierre Lindenbaum 166k

Hum, not sure I understood your question. Do you just want to get a sub-list of all the non classical (di-allelic) snps ? In a previous post I described a small piece of code to run a SAX parser with a javascript analyser. I'll use it here but it could be any SAX implementation. The following script is used to scan the (big) XML files from dbSNP. It records the rs## and the 'non-di-allelic' observed mutations.

var rs=null;
var observed=null;
var content=null;
var pat=/[ATGC]\/[ATGC]/i;
function startElement(uri,localName,name,atts)
        {
        if(localName=="Rs" && observed==null)
                {
                rs="rs"+atts.getValue("rsId");
                }
        else if(localName=="Observed" && observed==null)
                {
                content="";
                }
        }
function characters(s)
        {
        if(content!=null) content+=s;
        }

function endElement(uri,localName,name)
        {
        if(localName=="Rs")
                {
                if(!observed.match(pat))
                        {
                        println(rs+"\t"+observed);
                        }
                rs=null;
                observed=null;
                }
        else if(observed==null && localName=="Observed")
                {
                observed=content;
                }
        content=null;
        }

invoking the SAX/js script:

java -jar saxscript.jar  -f scansnps.js  ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/XML/ds_ch1.xml.gz

result:

rs242   -/T
rs1433  -/A
rs3483  -/ATTT
rs4179  -/CAGA
rs6485  -/AAT
rs7563  -/AA
rs16354 -/AACCCA
rs16356 -/ATT
rs16357 -/TGTGAA
rs16358 -/ATAA
rs16359 -/GA
rs16360 -/TA
rs16361 -/CT
rs16439 -/CAGA
rs16626 -/CGTGAAGTCC
rs16631 -/TA
rs16644 -/AAC
rs16670 -/TTTAA
rs16679 -/CT
rs16684 -/CTCTGGC
rs16686 -/AA
rs16725 -/ACAAA
rs16729 -/TCAAT
rs20418 -/TAG
rs20421 -/CGG
rs20431 -/T
rs25569 -/AC
rs25571 -/TT
rs140711        -/AG
rs140758        -/GAAA
rs140798        -/TTGT
rs140838        -/TCGT
rs140840        -/TT
rs140841        -/TAACTA
rs140849        -/GT
rs140858        -/CT
rs140861        -/CACT
rs140862        -/TAAG
rs140864        -/GAA
rs140865        -/ACTACATGA
rs171241        -/C
rs209574        -/GC
rs239965        -/T
rs284043        -/G
rs301740        -/TCTC
rs316275        -/T
rs319679        -/GTAG
rs332778        -/T
rs350183        -/A
rs365861        -/A
rs366664        -/A
rs390580        -/CTCTCT
rs391526        -/T
rs393900        -/T
rs398196        -/A
rs410422        -/TT
rs420240        -/T
rs431835        -/AGATAT
rs435486        -/AAAAA
rs446074        -/A
rs446693        -/GA
rs480306        -/TGG
rs481407        -/TATT
rs485475        -/CAACAACAC
rs486207        -/G
(....)

hope it helps

ADD COMMENT • link updated 6.8 years ago by Ram 45k • written 15.3 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

You can get a similar results with UCSC exchanging class="single" by class="in-del". Of course you could use class="single" OR class="in-del". But, indels aren't treat like SNPs despite some being SNPs in a HapMap sense. You could have for instance a A/T/G/AT. An -/A/T make sense but I haven't found any.

ADD REPLY • link 15.3 years ago by Jarretinha 3.5k

0

Entering edit mode

this SAX script can be modified for example, with the very same script you can find the positions of each snps on any assembly by looking at the MapLoc tag. And this information is always available , no need to wait for the UCSC to upload & digest it.

ADD REPLY • link 15.3 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

That's great! Didn't considered that point. Saw your tweet on dbSNP131, too. But somehow your scansnps.js isn't working here. Encoding complain. Tested the query on PubMed and still works fine.

ADD REPLY • link 15.3 years ago by Jarretinha 3.5k

0

Entering edit mode

ah, the jar on google code is old ( not gzip). Let me update it...

ADD REPLY • link 15.3 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

ok, I updated the jar file http://lindenb.googlecode.com/files/saxscript.jar

ADD REPLY • link 15.3 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

I don't think it's your code. Smth local. Tested the update version and still issues the same error. All encodings are UTF-8. Tested 3 java versions/flavors and nothing. :(

ADD REPLY • link 15.3 years ago by Jarretinha 3.5k

0

Entering edit mode

Yes that works fine! Tryed ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/XML/ds_ch22.xml.gz (about 200MB) There is a little typo in your ftp address under invoking the SAX/js script: ftp:/tp.ncbi.nih.gov/... should be ftp://ftp.ncbi.nih.gov. Maybe that was Jarretinha's problem?

ADD REPLY • link 15.3 years ago by Michael 55k

0

Entering edit mode

thanks, I fixed the typo

ADD REPLY • link 15.3 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

I also fixed the script. (added condition 'observed==null') : the genotypes are now the first (digested) version of the possible mutations.

ADD REPLY • link 15.3 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

No, I'm using a local version of the XML files. Really don't understand what's wrong. It's probably smth I missed in my java installation.

ADD REPLY • link 15.3 years ago by Jarretinha 3.5k

3

Entering edit mode

15.3 years ago

Michael 55k

Regarding the SNP detection part of my question I just read through the Maq-FAQ.

If there are poly-allelic SNPs in the databases, they must have been detect somehow, and yes, Maq can be used to detect poly-allelic SNPs, at least according to the user manual the call is:

maq cns2snp consensus.cns >cns.snp

Extract list of SNPs

Then following the FAQ:

Consensus Calling

What do those "S", "M" and so on mean in the cns2snp output?

They are IUB codes for heterozygotes. Briefly:
M=A/C, K=G/T, Y=C/T, R=A/G, W=A/T, S=G/C,
D=A/G/T, B=C/G/T, H=A/C/T, V=A/C/G,
N=A/C/G/T
  

I still have to test this on real data but at least in theory this should work and all types of SNPs can be predicted. This does not tell you anything about the validity of the called SNPs though. Should have a look at the papers mentioned by Jarretinha.

Edit: SNP-detection

Did some more reading/thinking to get this right.

In principle polymorphism are studied by taking samples from members of (possibly multiple) populations (see e.g.: NCBI SNP primer, HapMap, Nature (2005)). If a second allele for a genomic position is prevalent in a significant (however this is defined, e.g. there was a >=1% criterium) part of the population, then it becomes a SNP. If a third or fourth allele is discovered at the same locus and meeting the detection criteria and is submitted, then this becomes what we find in the databases given the searches above.

To detect point mutations from a single sample by high-throughput sequencing is rather new and something very different.

Can this be called a SNP? Not immediately, because the prevalence in a population is not assessed. As Jarretinha stated, the number of point mutations that can be found for a single position depends on the ploidy of the organism. For human somatic cells (diploid) there at most two different alleles possible in the consensus sequence (found by Maq) if the marker is heterozygous (e.g. A/C). If the reference is different at that point, that might give rise to (e.g. T/A/C) that there exist are more than three alleles.

If the sample has higher ploidy, or of course due to sequencing/alignment errors, then more than more heterozygotes can be in the consensus, and that is why aligners support this.

ADD COMMENT • link updated 5.6 years ago by Ram 45k • written 15.3 years ago by Michael 55k

0

Entering edit mode

I know that MAQ can call a given SNP over a reference. But, to detect a poly-allelic site is way more complicated. Consider the HIV case. Now you have a collection of reads from different subjects (i.e. a population sample) and needs to decide if the variation in a given locus is poly-allelic. How does MAQ do that?

ADD REPLY • link 15.3 years ago by Jarretinha 3.5k

0

Entering edit mode

It does not. It is not possible to detect more than two allels in one individual! I was misunderstanding the problem of SNP detection.

ADD REPLY • link 15.3 years ago by Michael 55k

0

Entering edit mode

From the population genetics point of view SNP detection is quite complicated. It depends on sample size, effective population size and other hard-to-measure parameters. Using very conservative parameters to estimate the expected number of SNPs in humans give us around 182000. So, there is a lot of space in this direction. And no software in hand. This will be FUN!!!

ADD REPLY • link 15.3 years ago by Jarretinha 3.5k

0

Entering edit mode

Do you see? Population genetics and bioinformatics were made for each other :)

ADD REPLY • link 15.3 years ago by Jarretinha 3.5k

1

Entering edit mode

14.9 years ago

Jorge Amigo 14k

when talking about tri-allelic or even tetra-allelic SNPs one should take into deep consideration the resources to use. the largest SNP repositories around like dbSNP or Ensembl have some population data, although it is quite limited. although most of the currently known SNPs are considered to be bi-allelic, the only way of being sure that there isn't a third or fourth allele around is no other one than to genotype it in more and more samples. even though this is the only thing we can do right now, trying to find poly-allelic sites using current stablished repositories is not the best idea since the genotyping biases that all these repositories had since they were created have limited their use them for certain queries, poly-allelic sites in particular.

although HapMap is great population genetics resource, its bias for detecting bi-allelic SNPs have led them not to detect many sites that are now known as tri-allelic, as mentioned on other answers here. the 1000 Genomes project aims to reveal such data, as they are aiming to detect variants with very low frequency (below 1%), so it will be IMHO the best place where to look and find such information. my group is currently working with 1000 Genomes pilot data, and although it is not final data we have detected over 10K tri-allelic sites on the human genome, so although it hasn't been published yet I guess you can just have this 1 tri-allelic per 1000 SNPs as a valid rough figure in your mind when you think about poly-allelic SNPs.

when doing de novo sequencing you cannot detect more than 2 possibilities for each SNP in a single sequence, so unless I have missunderstood your point you will never be able to detect a third allele from de novo sequencing unless comparing your sequence against a reference (sequence, sequences, variant list, ...) or againts other sequences.

ADD COMMENT • link 14.9 years ago by Jorge Amigo 14k

1

Entering edit mode

14.9 years ago

Larry_Parnell 16k

This is a response on why you want to find such tri- and multi-allelic SNPs. The m301 SNP (301 bp upstream of start site of transcription) of the CRP gene is tri-allelic. Not only does the SNP associate with CRP levels in plasma - CRP is a measure of inflammation - it also shows allele-specific interactions with fenofibrate, a drug taken to lower blood triglyceride levels, as pertaining to the reduction of CRP after a 3-week intervention of fenofibrate.

ADD COMMENT • link 14.9 years ago by Larry_Parnell 16k

1

Entering edit mode

12.4 years ago

Erik Garrison ★ 2.4k

Although most variant detectors try to force everything into a biallelic model, polyallelic loci are surprisingly common in the genome, particularly as you consider haplotypes of increasing length. The accurate detection of these has been a major point of development in FreeBayes, and I would suggest its use in this context.

ADD COMMENT • link 12.4 years ago by Erik Garrison ★ 2.4k

Ram · Accepted Answer · 2010-03-26

6

Entering edit mode

15.3 years ago

Yuri ★ 1.7k

You can query UCSC snp130 table. The fields of interest would be observed and class. You can use their public mysql server and run this sql query:

SELECT name,chrom,chromStart,chromEnd,observed,class,avHet
  FROM snp130
  WHERE length(observed)>3 and class="single" and avHet>0

I've added a condition to find only SNPs with average heterozygosity over zero. It finds 4949 SNPs in hg19, 490 are 4-allelic.

ADD COMMENT • link updated 6.8 years ago by Ram 45k • written 15.3 years ago by Yuri ★ 1.7k

0

Entering edit mode

Yeah! This works pretty well! But, if you search single and in-del you will find SNPs with indels. I don't know why they're not on single. And is rather strange to define poly-allelism with indels.

ADD REPLY • link 15.3 years ago by Jarretinha 3.5k

0

Entering edit mode

Yes, confirmed, got exactly the same number! Just need to call USE hg19; before. This method is equally valid.

ADD REPLY • link 15.3 years ago by Michael 55k