Question

How To Distinguish Alleles From Paralogs

7

Entering edit mode

12.5 years ago

Mark ▴ 70

Hey, guys!

Sorry for the noob question, but I'm trying to figure out how to distinguish alleles from paralogs. I'm working in a non-model system, so I can't do complementation tests or anything like that.

Do I just look at the context (i.e. surrounding sequence)?

How have people done this?

Thanks so much, guys!! Much appreciated!

CLARIFICATION: So, we're looking for opsin genes in fireflies. We're probably going to do RNAseq using tissue that should have opsins at their peak expression levels. From that sequence data, our hope is to be able to tell how many different opsin genes there are in our species of firefly (very little is known about opsin gene count in fireflies, but there are at least two opsins).

allele gene • 8.0k views

ADD COMMENT • link updated 7.4 years ago by Biostar 20 • written 12.5 years ago by Mark ▴ 70

2

Entering edit mode

can you please provide more information? Is your approach experimental or bioinformatics? Why don't you explain what have you tried so far, and what are you having difficulty with? I can't understand the question as it is written now.

ADD REPLY • link 12.5 years ago by Giovanni M Dall'Olio 28k

2

Entering edit mode

I agree with Giovanni. This could be a very good question if you can give more detail on what you've done and what did and did not succeed.

ADD REPLY • link 12.5 years ago by Larry_Parnell 16k

0

Entering edit mode

I'd be especially appreciative if you could point me to relevant literature. I know that the copy number variant work in humans relies on detecting different levels of some kind of probe that binds to the paralogs, but I don't expect variation in copy number-I just don't know how many paralogs there are. Hopefully, that helps clarify a little more.

ADD REPLY • link 12.5 years ago by Mark ▴ 70

0

Entering edit mode

So, we're looking for opsin genes in fireflies. We're probably going to do RNAseq using tissue that should have opsins at their peak expression levels. From that sequence data, our hope is to be able to tell how many different opsin genes there are in our species of firefly (very little is known about opsin gene count in fireflies, but there are at least two opsins). Does this help?

ADD REPLY • link 12.5 years ago by Mark ▴ 70

0

Entering edit mode

Hopefully, the update helps?

ADD REPLY • link 12.5 years ago by Mark ▴ 70

score 3 · Answer 1 · 2011-11-10

3

Entering edit mode

12.5 years ago

David Quigley 11k

Inherent in the definition of "paralogous sequences" is that the sequences which are paralogs are found at separate loci in the same genome. This is typically thought to occur through gene duplication. Alleles are heritable sequence variations which by definition occupy a single place in the genome.

If any given organism in your population has one of two variations of gene X at the same place in the genome, those variations in X are alleles. If you have an individual with gene X and gene Y at different places in the genome, and X and Y are very similar to each other, X and Y are paralogs.

ADD COMMENT • link 12.5 years ago by David Quigley 11k

0

Entering edit mode

This implies you have some physical mapping of the genes. I doubt this is the case since they are looking at a non model species. Your method assumes you have some prior genomic information or gene physical mapping.

ADD REPLY • link 12.5 years ago by Philippe ★ 1.9k

score 3 · Answer 2 · 2011-11-10

I wanted to comment on Casey Bergman's reply but it got too long and I was hitting the word limit for a comment...

I think it could be informative to look at the distribution of sequence divergence between these copies you want to classify as paralogs or alleles. You could expect a pool of genes with a similarly low divergence (typically alleles but also few very recent copies) and another population of genes with higher divergence and more variable sequence identity (most likely to be paralogs).

In addition, comparing your genes with orthologous genes in closely related species might give you insights on the rate of evolution of protein coding genes in you species of interest. With such a "molecular clock" you can have an estimation of the age of potential paralogs (looking at the dS). Having a closer view at your data you can then try to define a specific threshold to isolate genes with a Ks high enough to be likely to be paralogous genes (or, in contrast, potential alleles).

Of course you will still have some false positives in both categories since some paralogous genes can be highly conserved (even at synonymous sites, most likely if they are recent) and some alleles can have more nucleotide differences than average...

score 2 · Answer 3 · 2011-11-10

2

Entering edit mode

12.5 years ago

Casey Bergman 18k

Genomic contiguity is the definitive data, but if you are only sampling the transcriptome, this may not work beyond finding alternative splice forms. Another approach would be to assume that some/most paralogues will be more divergent than alleles, then resequence alleles at a few single copy loci, measure silent site diversity at these loci, and then use this as a baseline to find putative paralogues that have a statistically greater Ks than the distribution of allelic silent site diversity.

ADD COMMENT • link 12.5 years ago by Casey Bergman 18k

0

Entering edit mode

Thanks, Casey! I'm satisfied with this. It does beg the question of how to "resequence alleles at a few single copy loci" without already known that it's a single locus (I'll just have to design locus-specific primers, I'm assuming?). Thanks so much!

ADD REPLY • link 12.5 years ago by Mark ▴ 70