Help with species tree with orthofinder, iqtree and branch labeling
1
0
Entering edit mode
10 weeks ago
san96 ▴ 190

Hi everyone,

I’m trying to build a species tree for the first time and I would like to clarify a few doubts regarding sequence labels and the workflow I’m following.

Data In my Single_Copy_Orthologue_Sequences folder I have files like:

ls
N0.HOG0000162.fa
N0.HOG0000271.fa

Example content of N0.HOG0000162.fa:

>AT3G02650.1|PACid_19663616
MLRSFLCRSQNASRNLAVTRISKKKTQTTHSLTSLSRFSYLESSGNASVRNIRFFSTSPPTEENPVSLPADEIPISSAAE...
>evm_27.model.AmTr_v1.0_scaffold00066.198
MWRYSLLRASSIRSQWLNRANPKTLASTSALSSCLEVYTNHRKNHGNPSFMSRESHSVAETSSYDGGNPSFSSNVSDGSS...

The first header corresponds to Arabidopsis The second header corresponds to Amborella

My workflow was:

Step 1: Orthofinder
orthofinder -f ./prot_longest -t 30 -o orthofinder


Step 2: mafft 
mafft --auto --thread 30 "$file" > "$output_file"


Step 3: trimal
trimal -in "$file" -out "$output_file" -automated1- -fasta -htmlout "$output_file"


Step 3: Concat
Concat: https://github.com/nylander/catfasta2phyml
$CATFASTA --concatenate ${ALIGN_DIR}/*_trim.fa > $SUPERMATRIX 2> $PARTITIONS

# Clean file names and prepare partitions for IQ-TREE/RAxML (protein)
sed -i -e "s#${ALIGN_DIR}/##" -e "s/_trim.fa//" -e "s/^/PROT, /" $PARTITIONS

Example output of supermatrix: head supermatrix.phy

10 893113
AT3G02650.1|PACid_19663616    MLRSFLCRSQNASRNL...
evm_27.model.AmTr_v1.0_scaffold00066.198  MWRYSLLRASSIR...

Partition file (partitions.txt):

PROT, N0.HOG0000162 = 1-560
PROT, N0.HOG0000271 = 561-1440
PROT, N0.HOG0000277 = 1441-2422

Questions:

I notice that I don’t have clear species labels in my headers, only sequence IDs.

In the resulting species tree, how will my species be labeled?

Will IQ-TREE use these IDs as taxon names?

Am I missing a step if I want readable species names (like “Amborella”) in the final tree?

I was thinking of running IQ-TREE like this:

iqtree -s supermatrix.phy -m MFP -bb 1000 -alrt 1000

Is this correct for a first species tree?

Should I consider using partitions (-spp partitions.txt) here?

Thank you very much for your help!

iqtree trimal CAFE5 orthofinder mafft • 5.9k views
ADD COMMENT
0
Entering edit mode

Thank you so much.

ADD REPLY
0
Entering edit mode

I don’t have clear species labels in my headers, only sequence IDs.

I have not used iqtree, so this is only a thought ... should it be your responsibility to make sure that the proteins are labeled with names you want to show up in the final output?

Arabidopsis ID's are standard and can be mapped to gene names but the second ID looks custom and would be difficult to map (unless there is a "key" file).

ADD REPLY
1
Entering edit mode
13 days ago
Kevin Blighe ★ 90k

In your current workflow, the sequence IDs in the FASTA headers, such as "AT3G02650.1|PACid_19663616" and "evm_27.model.AmTr_v1.0_scaffold00066.198", will serve as the taxon names in the resulting species tree. IQ-TREE treats these headers as the labels for the leaves in the phylogenetic tree. This means the tree branches will be labeled with these sequence IDs instead of readable species names like "Arabidopsis" or "Amborella".

You are missing a step to achieve readable species names. Before concatenation, replace the sequence IDs in each aligned FASTA file with the corresponding species name. Since these are single-copy orthologs from OrthoFinder, each file contains one sequence per species. Use a script to map and rename the headers. For example, create a mapping file (e.g., species_map.txt) with lines like:

AT3G02650.1|PACid_19663616  Arabidopsis
evm_27.model.AmTr_v1.0_scaffold00066.198    Amborella

Then, use a tool like sed or a Python script to rename headers in all *_trim.fa files. Here is an example Python script:

import sys
from Bio import SeqIO

mapping = {}
with open('species_map.txt') as f:
    for line in f:
        seq_id, species = line.strip().split('\t')
        mapping[seq_id] = species

for record in SeqIO.parse(sys.argv[1], 'fasta'):
    if record.id in mapping:
        record.id = mapping[record.id]
        record.description = ''
    print(record.format('fasta'))

Run it as:

python rename_headers.py input_trim.fa > output_renamed.fa

Repeat for all files, then concatenate.

Your proposed IQ-TREE command is suitable for a first species tree, as it uses ModelFinder Plus (-m MFP) for automatic model selection, ultrafast bootstraps (-bb 1000), and SH-aLRT branch tests (-alrt 1000). However, for better accuracy with a concatenated supermatrix, use partitions to allow different evolutionary models per orthogroup. Modify your command to:

iqtree -s supermatrix.phy -p partitions.txt -m MFP -bb 1000 -alrt 1000

Use -spp instead of -p if you want partition-specific branch lengths, but this increases computation time. Ensure your partitions.txt is in the correct format for IQ-TREE (e.g., without "PROT," if specifying models manually, but MFP handles it).

Kevin

ADD COMMENT

Login before adding your answer.

Traffic: 3488 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6