Hi everyone,
I’m trying to build a species tree for the first time and I would like to clarify a few doubts regarding sequence labels and the workflow I’m following.
Data In my Single_Copy_Orthologue_Sequences folder I have files like:
ls
N0.HOG0000162.fa
N0.HOG0000271.fa
Example content of N0.HOG0000162.fa:
>AT3G02650.1|PACid_19663616
MLRSFLCRSQNASRNLAVTRISKKKTQTTHSLTSLSRFSYLESSGNASVRNIRFFSTSPPTEENPVSLPADEIPISSAAE...
>evm_27.model.AmTr_v1.0_scaffold00066.198
MWRYSLLRASSIRSQWLNRANPKTLASTSALSSCLEVYTNHRKNHGNPSFMSRESHSVAETSSYDGGNPSFSSNVSDGSS...
The first header corresponds to Arabidopsis The second header corresponds to Amborella
My workflow was:
Step 1: Orthofinder
orthofinder -f ./prot_longest -t 30 -o orthofinder
Step 2: mafft
mafft --auto --thread 30 "$file" > "$output_file"
Step 3: trimal
trimal -in "$file" -out "$output_file" -automated1- -fasta -htmlout "$output_file"
Step 3: Concat
Concat: https://github.com/nylander/catfasta2phyml
$CATFASTA --concatenate ${ALIGN_DIR}/*_trim.fa > $SUPERMATRIX 2> $PARTITIONS
# Clean file names and prepare partitions for IQ-TREE/RAxML (protein)
sed -i -e "s#${ALIGN_DIR}/##" -e "s/_trim.fa//" -e "s/^/PROT, /" $PARTITIONS
Example output of supermatrix: head supermatrix.phy
10 893113
AT3G02650.1|PACid_19663616 MLRSFLCRSQNASRNL...
evm_27.model.AmTr_v1.0_scaffold00066.198 MWRYSLLRASSIR...
Partition file (partitions.txt):
PROT, N0.HOG0000162 = 1-560
PROT, N0.HOG0000271 = 561-1440
PROT, N0.HOG0000277 = 1441-2422
Questions:
I notice that I don’t have clear species labels in my headers, only sequence IDs.
In the resulting species tree, how will my species be labeled?
Will IQ-TREE use these IDs as taxon names?
Am I missing a step if I want readable species names (like “Amborella”) in the final tree?
I was thinking of running IQ-TREE like this:
iqtree -s supermatrix.phy -m MFP -bb 1000 -alrt 1000
Is this correct for a first species tree?
Should I consider using partitions (-spp partitions.txt) here?
Thank you very much for your help!
Thank you so much.
I have not used iqtree, so this is only a thought ... should it be your responsibility to make sure that the proteins are labeled with names you want to show up in the final output?
Arabidopsis ID's are standard and can be mapped to gene names but the second ID looks custom and would be difficult to map (unless there is a "key" file).