Distinguishing between gene duplication and variation
0
0
Entering edit mode
16 months ago
Dunois ★ 2.5k

I'm interested in finding out whether a subset of sequences I have in my transcriptome assembly (bulk RNA-seq with tissue pooled from multiple individuals from the same species) are the result of genetic variation or gene duplication. The species does not have a genome available.

I realize that I would probably have to construct and examine a MSA, but if this is the case, what are the specific patterns I am looking for that would allow me to distinguish between the two? I would be grateful for any pointers to publications and/or tools also.

sequence-analysis • 823 views
ADD COMMENT
0
Entering edit mode

Thinking out loud than providing a specific answer.

This is probably going to be tough to determine with the data you have. Since genes aren't expressed at the same level there is no easy quantitative way to infer duplication. Paralogs are under pressure to differentiate/change so depending on the level of variation you are seeing in the sequence may be important. This sounds more like an "allelic imbalance" type situation (where expression from two alleles may seem like a duplication) but most of the packages the analyze that probably need a reference genome (which you do not have).

I think @Pierre had a way to export aligned reads from a region of BAM as aligned fasta so you may not need to strictly do MSA's.

ADD REPLY
0
Entering edit mode

Following @GenoMax in thinking out loud...

If there is within-read diversity (i.e., multiple polymorphisms within individual reads/read pairs) AND the individual samples are indexed AND the organism's ploidy is known, then you might be able to reconstruct partial haplotypes to address the question. If the number of haplotypes for an individual exceeds its ploidy, then the gene is duplicated.

SNP frequency differences are unlikely to be informative for the reasons mentioned above.

And I assume by 'tools' you mean computational rather than molecular/bench methods (probably a safe assumption, given the forum).

ADD REPLY
0
Entering edit mode

Thank you all for your inputs. I will respond here to all three of you that provided feedback to keep things concise.

So to elaborate, I've actually run OrthoFinder already on translated protein sequences (something @seidel pointed out in their comment), and this subset of sequences I am looking at are sequences that are potentially paralogs. It's because the assembly is bulk RNA-seq from a pool of (around 40) individuals, we want to try and see if this is actual paralogy or if it is the result of genetic variation.

AND the individual samples are indexed AND the organism's ploidy is known

There were no individual samples; it was just one single library for which 40 or so individuals were sacrificed and their RNA pooled for sequencing. So I have just one sample to work with here.

I think @Pierre had a way to export aligned reads from a region of BAM as aligned fasta so you may not need to strictly do MSA's.

I've already generated the MSA.

All this said, I was thinking a little bit about approaching this from a pairwise alignment perspective, and wouldn't it be correct to assume that if the sequences are relatively low in sequence identity (so below 90%) it would be highly unlikely that they're the result of genetic variation? Would this be an acceptable heuristic? Of course, this would mean that I will have some false positives in the "genetic variation" category (since we cannot exclude the possibility of a pair of highly identical sequences not being the result of gene duplication), but this is an easy heuristic, and still better than nothing?

ADD REPLY
0
Entering edit mode

Also thinking out loud, as I'm not an expert on phylogenetics, but if you have a transcriptome, then couldn't you translate it and run something like OrthoFinder which scores orthologs for likely duplication? Of course, you would have to pick a few other species to compare it to.

ADD REPLY

Login before adding your answer.

Traffic: 3480 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6