Help with species tree with orthofinder, iqtree and branch labeling
0
0
Entering edit mode
12 hours ago
san96 ▴ 190

Hi everyone,

I’m trying to build a species tree for the first time and I would like to clarify a few doubts regarding sequence labels and the workflow I’m following.

Data In my Single_Copy_Orthologue_Sequences folder I have files like:

ls
N0.HOG0000162.fa
N0.HOG0000271.fa

Example content of N0.HOG0000162.fa:

>AT3G02650.1|PACid_19663616
MLRSFLCRSQNASRNLAVTRISKKKTQTTHSLTSLSRFSYLESSGNASVRNIRFFSTSPPTEENPVSLPADEIPISSAAE...
>evm_27.model.AmTr_v1.0_scaffold00066.198
MWRYSLLRASSIRSQWLNRANPKTLASTSALSSCLEVYTNHRKNHGNPSFMSRESHSVAETSSYDGGNPSFSSNVSDGSS...

The first header corresponds to Arabidopsis The second header corresponds to Amborella

My workflow was:

Step 1: Orthofinder
orthofinder -f ./prot_longest -t 30 -o orthofinder


Step 2: mafft 
mafft --auto --thread 30 "$file" > "$output_file"


Step 3: trimal
trimal -in "$file" -out "$output_file" -automated1- -fasta -htmlout "$output_file"


Step 3: Concat
Concat: https://github.com/nylander/catfasta2phyml
$CATFASTA --concatenate ${ALIGN_DIR}/*_trim.fa > $SUPERMATRIX 2> $PARTITIONS

# Clean file names and prepare partitions for IQ-TREE/RAxML (protein)
sed -i -e "s#${ALIGN_DIR}/##" -e "s/_trim.fa//" -e "s/^/PROT, /" $PARTITIONS

Example output of supermatrix: head supermatrix.phy

10 893113
AT3G02650.1|PACid_19663616    MLRSFLCRSQNASRNL...
evm_27.model.AmTr_v1.0_scaffold00066.198  MWRYSLLRASSIR...

Partition file (partitions.txt):

PROT, N0.HOG0000162 = 1-560
PROT, N0.HOG0000271 = 561-1440
PROT, N0.HOG0000277 = 1441-2422

Questions:

I notice that I don’t have clear species labels in my headers, only sequence IDs.

In the resulting species tree, how will my species be labeled?

Will IQ-TREE use these IDs as taxon names?

Am I missing a step if I want readable species names (like “Amborella”) in the final tree?

I was thinking of running IQ-TREE like this:

iqtree -s supermatrix.phy -m MFP -bb 1000 -alrt 1000

Is this correct for a first species tree?

Should I consider using partitions (-spp partitions.txt) here?

Thank you very much for your help!

iqtree trimal CAFE5 orthofinder mafft • 329 views
ADD COMMENT
0
Entering edit mode

Thank you so much.

ADD REPLY
0
Entering edit mode

I don’t have clear species labels in my headers, only sequence IDs.

I have not used iqtree, so this is only a thought ... should it be your responsibility to make sure that the proteins are labeled with names you want to show up in the final output?

Arabidopsis ID's are standard and can be mapped to gene names but the second ID looks custom and would be difficult to map (unless there is a "key" file).

ADD REPLY

Login before adding your answer.

Traffic: 3323 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6