Question

How to make a phylogenetic tree from the ground up

0

Entering edit mode

8.0 years ago

n00bgenome ▴ 40

Hi,

I had some follow-up questions to this for the very n00b user. I'd like to make a phylogenetic tree of a specific type of sigma factors. So if, for example, I search rpoD in Kegg (which is the gene name for the primary housekeeping gene in E. coli), it'll pop up 1000's of genes. Within these 1000's of genes, most of them have a "motif" that describes a group 2 and a group 4 region of the sigma factor. What I would like to do is search all of these genes based on the group 2 and group 4 amino acid sequences and bin them according to their similarity. So these 1000's of genes from many organisms would be put into 10 or so classes based on the similarity of BOTH the group 2 and group 4 sequences.

I have downloaded MEGA and Seaview, and both appear to be able to handle the creation of this from their description, but I'm having a very hard tim getting started. A little direction would be most appreciated. How do I "collect" all of the sigma factors that will serve as the tree that I want to construct? And how do I search for just the group 2 and group 4 amino acid sequence and score by similarity?

Thank you to all in advance!

phylogeny tree • 1.7k views

ADD COMMENT • link 8.0 years ago by n00bgenome ▴ 40

1

Entering edit mode

You could start by searching at NCBI. See the protein/protein cluster hits in the right column. You can collect sequences from those two groups based on criteria you choose (what kind of organisms etc).

Briefly --> Select one of the clusters --> Click on the name of the cluster to open the page for that cluster --> Click on "Protein" link under Related Information at top right of the page --> On "Proteins" page that opens select all entries --> Click on "Summary" at top left of the page --> Change to "Fasta" --> Use "Send to" button and select "file" to send the sequences to a file that you can save. This file then can be used as input for your alignment/tree construction in MEGA.

Is this a homework/assignment question?

ADD REPLY • link 8.0 years ago by GenoMax 141k

0

Entering edit mode

Hi genomemax2,

Thanks for the reply! I'm actually beginning a research project, and so while I don't need to grab every conceivable genome, I do need to get a good coverage of the representative genomes that have are available. There are some great papers where this has been done, but they aren't detailed enough on the "how" for the n00b user (understandably).

Your advice was very helpful, and it was great to actually see a tree. But NCBI doesn't seem to have enough coverage (only a few 100 sequences)? On KEGG, if I search for rpoD, and then select one of the 100's of organisms (http://www.genome.jp/dbget-bin/www_bget?ko:K03086) and pick the Group 2 protein motif, it pops up 119,000 matches (http://www.genome.jp/dbget-bin/www_bget?pf:Sigma70_r2). How I would get this into a FastA file is one big challenge? And then I need to do the same thing with the Group 4 protein motif, and query the database to rank the similarity between organisms Group 2 and Group 4 sites (e.g the same Group 2 site but different Group 4, etc.). Any further guidance?

ADD REPLY • link 8.0 years ago by n00bgenome ▴ 40

0

Entering edit mode

If you were to move one line up to "proteins" you can see that there are almost 78K sequences. So that should be plenty for you.
As you have discovered below you probably don't need each and every one of these sequences since many are likely identical. If you really want to make an alignment/tree from an enormous number of sequences then you may need to move this analysis to server with a good bit of RAM/processor power and use T-coffee, Muscle, Phylip etc.

ADD REPLY • link 8.0 years ago by GenoMax 141k

0

Entering edit mode

So I guess the question boils down to the following:

How do you do a multiple sequence alignment (2 sequences, Group 2 and Group 4 of sigma factors) with broad breadth across genomes, outputted as fasta files?

ADD REPLY • link 8.0 years ago by n00bgenome ▴ 40

score 0 · Answer 1 · 2016-04-27

So I can now partially answer this question.

From http://pfam.xfam.org/, one can get the protein family they are interested in. For the group 2 sigma factor, it matched ~20000 sigma factors. In the domains tab, it will even show you the different architectures it comes in, (with only Group 4 sigma factors, like the ECF subfamily, or with Group 3 & 4 etc.). It will also show you a phylogenetic tree of the group 2 sigma factor.

However, it will not parse the data into the different architectures, so I can download just the sequences that have Group 4 and Group 2 architectures only. So I can download ~20,000 Group 2 sequences into Mega or Seaview, but neither of them can align because it states it is out of memory (how much would I need?). Even if I could get it to align, I guess my next step would be to parse out only the sequences that only have a Group 4 sequence (and not any other Groups), but how would I do that?