I am working with a directory of approximately 3,000 trimmed protein alignment files. Every alignment is one single copy orthologue containing sequences of 7 species obtained by Orthofinder. In fasta headers, I have species name_geneID.
I want to calculate the best substitution models and then infer a Maximum Likelihood SPECIES tree (so 7 branches) based on these 3000 genes.
I am doing this for the first time so I need some help, please.
What is in general the best approach to do that: concatenate all alignments and then work on that OR get 3000 gene trees (with the best substitution model taken into account for each) and then concatenate those somehow into a species tree (with IQtree or RAxML)?
The very final goal is to make time calibrated species tree in PAML (I have some divergence time points already) and then use that for subsequent evolutionary genomic studies (dN/dS etc.... ).
So far, I used IQTree to (hopefully) calculate substitution models for every alignment with the -p flag (calling the folder with alignments) since this is written under the subsection „Inferring species trees“:
iqtree-mpi -nt $NSLOTS -p trimmed --prefix concat -wca -B 1000
I have a couple of questions:
- Is this code the one that I need ? Is iqtree here calculating the best substitution model for each alignment? I am a little bit confused with the term of partitioning schemes...
I got a file called
concat_best_model_nexwith partition information. Are those the best substitution models for every gene (alignment) ?
I got a
concat.treefile but this is not a species tree where I expected to get 7 branches (one branch per species) but I got a big tree with all fasta sequences from every alignments (orthologue genes). I thought my fasta headers were wrong so I flipped it to geneID_species name and got the same.
Note: I am testing the workflow on 8 alignments/genes only, just to be faster, but the result should be the same.