Question

Workflow for big gene family analysis cross-species

0

Entering edit mode

4.0 years ago

lessismore ★ 1.3k

Dear all,

i'm dealing with a bulk of protein sequences from the same transcription factor family from distant organism families. I'd like to know what are the common good practices you use in this pipeline as i've seen that papers are very grey when they present this in their methods.

My analyses started using:
- HMM analysis to identify putative sequences in my target species.
- Filtering each gene for its longest variant
- It was followed by a conserved domain database CDD (by the way have you used it? with concise or full output?) that i use to filter the output for those with complete domains and with a significant hit for specific domain types.

Now i have few questions:

Alignment
Which algorithm and software do you recommend for the alignment.
Post-alignment processing
After the alignment, do you cut your alignment to focus only on the TF binding domain to make the tree construction easier?
Phylogeny What program do you recommend for tree construction for >500 seqs.
Which algorithms do you recommend for the tree constructions?
And how many bootstrap?
Do you suggest to collapse for bootstrap value e.g. >70?
Tree annotation and publication ready
What program do you use for annotating the tree?

If you can answer to one or few of these questions that would help already a lot.
Thanks in advance

phylogeny gene families • 638 views

ADD COMMENT • link updated 4.0 years ago by Jean-Karim Heriche 27k • written 4.0 years ago by lessismore ★ 1.3k

score 0 · Answer 1 · 2020-05-15

0

Entering edit mode

4.0 years ago

Jean-Karim Heriche 27k

To build trees, you could use Li Heng's TreeBeST. It's used by Ensembl Compara. You can also see there what their pipeline looks like.

ADD COMMENT • link 4.0 years ago by Jean-Karim Heriche 27k