Question

How to construct a tree with MEGA when some of the sequences are very short

1

Entering edit mode

9.5 years ago

vharshavardhanan ▴ 30

Hi, Im a begginer in MEGA software. I have 89 protein sequence for which I need to construct a phylogenetic tree using bootstrap method with 1000 replication with data set parameter with complete deletion. But I am not able to construct a tree because of 3 sequence whose protein length is very less when compared to other 86 sequence. Even I tried by deleting non conserved regions in all protein sequence but still I am not able to get a tree because the size of the smaller proteins become smaller and smaller. Kindly help me out in solving this problem.

Mega Phylogenetic Tree Protein • 7.4k views

ADD COMMENT • link updated 2.2 years ago by Ram 43k • written 9.5 years ago by vharshavardhanan ▴ 30

2

Entering edit mode

get rid of the short proteins ... if they are short they cannot be aligned and do not contribute information anyhow

ADD REPLY • link 9.5 years ago by Istvan Albert 100k

0

Entering edit mode

The 3 short protein sequence are upregulated in abiotic stresses. Is it ok if i omit the sequence because they have role in abiotic stresses?

Moreover, I have selected these 3 proteins for my experiments and its ongoing with RT-PCR and Real Time PCR. So is there any possibilty to include these 3 sequence?

ADD REPLY • link 9.5 years ago by vharshavardhanan ▴ 30

0

Entering edit mode

when doing science you can easily end up with unsolvable situations - in that case you have to find something else to move forward

ADD REPLY • link updated 2.2 years ago by Ram 43k • written 9.5 years ago by Istvan Albert 100k

Ram · Accepted Answer · 2014-11-12

Regardless of the approach or program you are using, the input for any phylogenetic estimation approach is an alignment, i.e., an inference of homology. Therefore, by necessity, your sequences must have a shared ancestry to even begin to infer a phylogeny. If the sequences are shorter but homologous, a multiple sequence alignment (of nucleotides or amino acids or both via a translation alignment for protein-coding sequences) ought to resolve the sequences by introducing gaps - insertions or deletions. It sounds like you're not doing this; when you say

The 3 short protein sequence are upregulated in abiotic stresses. Is it ok if I omit the sequence because they have role in abiotic stresses?

It suggests that your dataset may consist of multiple proteins, not the same protein across samples, which is a completely inappropriate input for phylogenetic techniques.

In other words, your workflow would be:

Construct a dataset of the same locus across all samples
Align the amino acids or nucleotides
Model selection for ML analysis or NJ distance corrections/uncorrected NJ/UPGMA/etc.
[If you decide to use a model: With an appropriate model, any likelihood (maximum likelihood or Bayesian) approach.]
Bootstrapping etc. for support.

If you do have sequences with a shared history, I would follow Istvan Albert's recommendation and remove the short sequences if they are truly unalignable.