How To Detect And Query Poly-Allelic Snps?
7
8
Entering edit mode
11.6 years ago

Today I had a lunchtime discussion with some colleagues about the existence of poly-allelic Single Nucleotide Polymorphisms (SNPs). Quoting Wikipedia to explain what I mean:

For example, two sequenced DNA fragments from different individuals, AAGCCTA to AAGCTTA, contain a difference in a single nucleotide. In this case we say that there are two alleles : C and T. Almost all common SNPs have only two alleles.

My point was that there could be some but I don't know any SNPs that have eg. 3 alleles such that there is either e.g. A/G/T, or all 4 bases in a population. If I wish to query a database, say BioMart, UCSC, HapMap for this, how would I do this?

If wanted to detect poly-allelic SNPs from sequencing data (de novo) using a reference sequence, which method/program could be used?

Edit: So far this seems to be real great stuff, got at least two different methods of getting the query right! But what about de novo (sorry, of course a reference sequence can be used, that's not really de novo) detection? Would for example Maq detect tri-allelic SNPs in short-reads?

snp allele biomart dbsnp • 8.8k views
1
Entering edit mode

That depend on your ploidy level. Cancer cell lines can be highly polyploid for certain regions/chromosomes. So, it's possible that a given locus carry 3 or more SNPs. That's common in cultivated plants, too. Wheat cultivars are typically dodecaploid. So, you question is relevant, but not easy.

0
Entering edit mode

When you say de novo, it means no reference sequence/population? Or smth like genotyping a intra-patient HIV population?

0
Entering edit mode

I meant using a reference sequence, maybe the use of de-novo is not very good here, I admit. MAQ can for example call SNPs from sequencing reads as far as I know. but would it also detect tri-allelic SNPS?

0
Entering edit mode

To correct myself, the 'de-novo' part of my question was stupid: of course, from a single individual at most two different alleles can be discovered. See my answer below.

6
Entering edit mode
11.6 years ago
Yuri ★ 1.6k

You can query UCSC snp130 table. The fields of interest would be observed and class. You can use their public mysql server and run this sql query:

SELECT name,chrom,chromStart,chromEnd,observed,class,avHet
FROM snp130
WHERE length(observed)>3 and class="single" and avHet>0


I've added a condition to find only SNPs with average heterozygosity over zero. It finds 4949 SNPs in hg19, 490 are 4-allelic.

0
Entering edit mode

Yeah! This works pretty well! But, if you search single and in-del you will find SNPs with indels. I don't know why they're not on single. And is rather strange to define poly-allelism with indels.

0
Entering edit mode

Yes, confirmed, got exactly the same number! Just need to call USE hg19; before. This method is equally valid.

3
Entering edit mode
11.6 years ago

This situation is rather cumbersome but quite common. Small deletions must be considered, too. There's a growing body of research on triallelic sites. You can check it yourself.

The only methodology I'm aware of is TriTyper described here. I've checked HapMart. They don't have a simple way to look for triallelic site!!! There are more than a thousand of these already detected in humans.

It's not hard to get the HapMap data and search for them. But, it's strange that it's not a default filter/attribute combination possibility.

Maybe Pierre could give you a tip?

--Edit--

After a somewhat long search for genotype calling, genotype imputation and poly-allelism detection I can conclude with a high degree of confidence: there is no such software. Yes. I was unable to find a software that explicitly treats the problem. Actual detection relies on brute force (i. .e, experimental verification) for which there are a lot of well developed methodologies. But, there are evidences that a large number of SNPs are miscalled right now. So, I do recommend a paper and two reviews about this:

Genotype imputation - Review

Human Triallelic Sites: Evidence for a New Mutational Mechanism? - Interesting Paper

At the beginning I thought that locating and naming a SNPs was a simple task. Now I can see that it's much more subtle. I check the SNPs definition. It includes indels too !!! So, genotype calling is a rather unexplored area. Most works just deal with biallelic sites for a given base.

-- Edit --

After researching little bit more I've stumble upon more software problems. All programs/algoritms/equations to estimate population genetics parameters (N_e, LD, recombination rate, etc) from SNPs data also assume biallelic loci with no indels.

-- Edit --

A paper with a partial solution.

Accurate detection and genotyping of SNPs utilizing population sequencing data

But, there is still more to it. I'm thinking about the problem.

3
Entering edit mode
11.6 years ago

Hum, not sure I understood your question. Do you just want to get a sub-list of all the non classical (di-allelic) snps ? In a previous post I described a small piece of code to run a SAX parser with a javascript analyser. I'll use it here but it could be any SAX implementation. The following script is used to scan the (big) XML files from dbSNP. It records the rs## and the 'non-di-allelic' observed mutations.

var rs=null;
var observed=null;
var content=null;
var pat=/[ATGC]\/[ATGC]/i;
function startElement(uri,localName,name,atts)
{
if(localName=="Rs" && observed==null)
{
rs="rs"+atts.getValue("rsId");
}
else if(localName=="Observed" && observed==null)
{
content="";
}
}
function characters(s)
{
if(content!=null) content+=s;
}

function endElement(uri,localName,name)
{
if(localName=="Rs")
{
if(!observed.match(pat))
{
println(rs+"\t"+observed);
}
rs=null;
observed=null;
}
else if(observed==null && localName=="Observed")
{
observed=content;
}
content=null;
}


invoking the SAX/js script:

java -jar saxscript.jar  -f scansnps.js  ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/XML/ds_ch1.xml.gz


result:

rs242   -/T
rs1433  -/A
rs3483  -/ATTT
rs4179  -/CAGA
rs6485  -/AAT
rs7563  -/AA
rs16354 -/AACCCA
rs16356 -/ATT
rs16357 -/TGTGAA
rs16358 -/ATAA
rs16359 -/GA
rs16360 -/TA
rs16361 -/CT
rs16439 -/CAGA
rs16626 -/CGTGAAGTCC
rs16631 -/TA
rs16644 -/AAC
rs16670 -/TTTAA
rs16679 -/CT
rs16684 -/CTCTGGC
rs16686 -/AA
rs16725 -/ACAAA
rs16729 -/TCAAT
rs20418 -/TAG
rs20421 -/CGG
rs20431 -/T
rs25569 -/AC
rs25571 -/TT
rs140711        -/AG
rs140758        -/GAAA
rs140798        -/TTGT
rs140838        -/TCGT
rs140840        -/TT
rs140841        -/TAACTA
rs140849        -/GT
rs140858        -/CT
rs140861        -/CACT
rs140862        -/TAAG
rs140864        -/GAA
rs140865        -/ACTACATGA
rs171241        -/C
rs209574        -/GC
rs239965        -/T
rs284043        -/G
rs301740        -/TCTC
rs316275        -/T
rs319679        -/GTAG
rs332778        -/T
rs350183        -/A
rs365861        -/A
rs366664        -/A
rs390580        -/CTCTCT
rs391526        -/T
rs393900        -/T
rs398196        -/A
rs410422        -/TT
rs420240        -/T
rs431835        -/AGATAT
rs435486        -/AAAAA
rs446074        -/A
rs446693        -/GA
rs480306        -/TGG
rs481407        -/TATT
rs485475        -/CAACAACAC
rs486207        -/G
(....)


hope it helps

0
Entering edit mode

You can get a similar results with UCSC exchanging class="single" by class="in-del". Of course you could use class="single" OR class="in-del". But, indels aren't treat like SNPs despite some being SNPs in a HapMap sense. You could have for instance a A/T/G/AT. An -/A/T make sense but I haven't found any.

0
Entering edit mode

this SAX script can be modified for example, with the very same script you can find the positions of each snps on any assembly by looking at the MapLoc tag. And this information is always available , no need to wait for the UCSC to upload & digest it.

0
Entering edit mode

That's great! Didn't considered that point. Saw your tweet on dbSNP131, too. But somehow your scansnps.js isn't working here. Encoding complain. Tested the query on PubMed and still works fine.

0
Entering edit mode

ah, the jar on google code is old ( not gzip). Let me update it...

0
Entering edit mode
0
Entering edit mode

I don't think it's your code. Smth local. Tested the update version and still issues the same error. All encodings are UTF-8. Tested 3 java versions/flavors and nothing. :(

0
Entering edit mode

Yes that works fine! Tryed ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/XML/ds_ch22.xml.gz (about 200MB) There is a little typo in your ftp address under invoking the SAX/js script: ftp:/tp.ncbi.nih.gov/... should be ftp://ftp.ncbi.nih.gov. Maybe that was Jarretinha's problem?

0
Entering edit mode

thanks, I fixed the typo

0
Entering edit mode

I also fixed the script. (added condition 'observed==null') : the genotypes are now the first (digested) version of the possible mutations.

0
Entering edit mode

No, I'm using a local version of the XML files. Really don't understand what's wrong. It's probably smth I missed in my java installation.

3
Entering edit mode
11.6 years ago

Regarding the SNP detection part of my question I just read through the Maq-FAQ.

If there are poly-allelic SNPs in the databases, they must have been detect somehow, and yes, Maq can be used to detect poly-allelic SNPs, at least according to the user manual the call is:

maq cns2snp consensus.cns >cns.snp


Extract list of SNPs

Then following the FAQ:

Consensus Calling

• What do those "S", "M" and so on mean in the cns2snp output?

They are IUB codes for heterozygotes. Briefly:

M=A/C, K=G/T, Y=C/T, R=A/G, W=A/T, S=G/C,
D=A/G/T, B=C/G/T, H=A/C/T, V=A/C/G,
N=A/C/G/T


I still have to test this on real data but at least in theory this should work and all types of SNPs can be predicted. This does not tell you anything about the validity of the called SNPs though. Should have a look at the papers mentioned by Jarretinha.

## Edit: SNP-detection

Did some more reading/thinking to get this right.

In principle polymorphism are studied by taking samples from members of (possibly multiple) populations (see e.g.: NCBI SNP primer, HapMap, Nature (2005)). If a second allele for a genomic position is prevalent in a significant (however this is defined, e.g. there was a >=1% criterium) part of the population, then it becomes a SNP. If a third or fourth allele is discovered at the same locus and meeting the detection criteria and is submitted, then this becomes what we find in the databases given the searches above.

To detect point mutations from a single sample by high-throughput sequencing is rather new and something very different.

Can this be called a SNP? Not immediately, because the prevalence in a population is not assessed. As Jarretinha stated, the number of point mutations that can be found for a single position depends on the ploidy of the organism. For human somatic cells (diploid) there at most two different alleles possible in the consensus sequence (found by Maq) if the marker is heterozygous (e.g. A/C). If the reference is different at that point, that might give rise to (e.g. T/A/C) that there exist are more than three alleles.

If the sample has higher ploidy, or of course due to sequencing/alignment errors, then more than more heterozygotes can be in the consensus, and that is why aligners support this.

0
Entering edit mode

I know that MAQ can call a given SNP over a reference. But, to detect a poly-allelic site is way more complicated. Consider the HIV case. Now you have a collection of reads from different subjects (i.e. a population sample) and needs to decide if the variation in a given locus is poly-allelic. How does MAQ do that?

0
Entering edit mode

It does not. It is not possible to detect more than two allels in one individual! I was misunderstanding the problem of SNP detection.

0
Entering edit mode

From the population genetics point of view SNP detection is quite complicated. It depends on sample size, effective population size and other hard-to-measure parameters. Using very conservative parameters to estimate the expected number of SNPs in humans give us around 182000. So, there is a lot of space in this direction. And no software in hand. This will be FUN!!!

0
Entering edit mode

Do you see? Population genetics and bioinformatics were made for each other :)

1
Entering edit mode
11.2 years ago

when talking about tri-allelic or even tetra-allelic SNPs one should take into deep consideration the resources to use. the largest SNP repositories around like dbSNP or Ensembl have some population data, although it is quite limited. although most of the currently known SNPs are considered to be bi-allelic, the only way of being sure that there isn't a third or fourth allele around is no other one than to genotype it in more and more samples. even though this is the only thing we can do right now, trying to find poly-allelic sites using current stablished repositories is not the best idea since the genotyping biases that all these repositories had since they were created have limited their use them for certain queries, poly-allelic sites in particular.

although HapMap is great population genetics resource, its bias for detecting bi-allelic SNPs have led them not to detect many sites that are now known as tri-allelic, as mentioned on other answers here. the 1000 Genomes project aims to reveal such data, as they are aiming to detect variants with very low frequency (below 1%), so it will be IMHO the best place where to look and find such information. my group is currently working with 1000 Genomes pilot data, and although it is not final data we have detected over 10K tri-allelic sites on the human genome, so although it hasn't been published yet I guess you can just have this 1 tri-allelic per 1000 SNPs as a valid rough figure in your mind when you think about poly-allelic SNPs.

when doing de novo sequencing you cannot detect more than 2 possibilities for each SNP in a single sequence, so unless I have missunderstood your point you will never be able to detect a third allele from de novo sequencing unless comparing your sequence against a reference (sequence, sequences, variant list, ...) or againts other sequences.

1
Entering edit mode
11.2 years ago

This is a response on why you want to find such tri- and multi-allelic SNPs. The m301 SNP (301 bp upstream of start site of transcription) of the CRP gene is tri-allelic. Not only does the SNP associate with CRP levels in plasma - CRP is a measure of inflammation - it also shows allele-specific interactions with fenofibrate, a drug taken to lower blood triglyceride levels, as pertaining to the reduction of CRP after a 3-week intervention of fenofibrate.

1
Entering edit mode
8.7 years ago
Erik Garrison ★ 2.3k

Although most variant detectors try to force everything into a biallelic model, polyallelic loci are surprisingly common in the genome, particularly as you consider haplotypes of increasing length. The accurate detection of these has been a major point of development in FreeBayes, and I would suggest its use in this context.