Question

Aligning sequences with multiple genetic codes!

0

Entering edit mode

7 months ago

George X. • 0

Hello everyone! I am doing a project on duplicated genes and I have a major difficulty on how to align sequences that use different genetic codes. I work with fasta files that contain sequences of protein coding genes, every fasta file includes genes from many species that are orthologs/paralogs. These genes even in the same species may use different genetic codes. My goal is to align the sequences, codon alignment, and afterwards to create N-J trees. For the whole process I work on MEGA11. I cannot organize the species-genes into groups with common genetic codes because I need all of them in the tree. One option, to my mind at least, would be the alignment not in codon but in DNA,or another option is to align their protein sequences or lastly a third option could be the change of the sequences in order to get rid of the stop codons and afterwards to use one common code for everything, by doing that I know that I will lose some info but at least I can work.Do you have any idea how to overcome this obstacle? Do you agree with any of this options? Thank you in advance!

genetic_codes codon MEGA11 alignment • 886 views

ADD COMMENT • link updated 7 months ago by 5heikki 11k • written 7 months ago by George X. • 0

1

Entering edit mode

I don't understand your problem. What do you mean when you write:

These genes even in the same species may use different genetic codes.

The only case where that would be true would be if you had gDNA and mtDNA encoded homologs. Or, I mean sure, there are cases where a codon may code something other than the "default", but you wouldn't know that unless you sequenced the actual expressed protein by mass spec or something. If you have e.g. putative stop codons in the middle of some ORF, very likely it's a pseudo gene which isn't expressed and is under completely different selection pressure than actual genes (basically it just acquires random mutations). Including such genes in any analysis is just going to distort the results

ADD REPLY • link 7 months ago by 5heikki 11k

0

Entering edit mode

Fisrt of all thank you very much for your answer! First of all,I give you more details about my project, so I download genes that are condisered paralogs from the KEGG database, there is a specific option on the base intefrace for that. Each one of my trees includes all orthologs and paralogs from a selection of species, for every tree we have one KO (database othology code).Regarding a possible selection of genes I see your point, however I have a difficulty on how I should filter my data. My main goal is to use as many as possible genes I have so being very strict on my selection is not an option for me, but I understand that a significant amount of the genes suggested might be pseudogenes. Here I give you an example of genes that are condisered paralogs however one gene is locaded on the nucleus and the other on the mitochondria: https://www.genome.jp/entry/taes:123096964 https://www.genome.jp/entry/taes:34688768 In your opinion what are the odds that these 2 genes are true paralogs? This example is not uncommon at all for my dataset!

ADD REPLY • link 7 months ago by George X. • 0

0

Entering edit mode

I think that the few protein-coding genes which are pretty much universally conserved in mtDNA are so for a good reason, e.g. the final product couldn't be "integrated" correctly into the mitochondrial membrane from the other side. My suggestion to you is to remove all mtDNA-encoded genes from your dataset. If the gDNA-encoded cox1 had some function to this plant, you would expect to see codon usage adaptation (over time its codon usage would become similar to other gDNA-encoded proteins). Do you see that?

ADD REPLY • link 7 months ago by 5heikki 11k

1

Entering edit mode

What is Codon Alignment?

When aligning DNA sequences, most algorithms only consider the best alignment for the nucleotide residues. The process of codon alignment considers the reading frame of the translated protein and adjusts the nucleotide alignment so that the protein alignment stays in frame.

Specifically, this tool may make the following adjustments to your input alignment:

*Add gaps to complete an incomplete codon.

*Shift nucleotides from one side of a gap to the other to produce complete codons.

*Combine nearby insertions or deletions (indels) to produce an intact reading frame. See Frameshift Compensation option below.

https://www.hiv.lanl.gov/content/sequence/CodonAlign/codonalign_explanation.html

ADD REPLY • link 7 months ago by pippo1980 ▴ 10