How to simplify phylip file headings
2
0
Entering edit mode
4.0 years ago

Suppose I have a phylip alignment like: (I should mention that it is interleaved)

KM894618.1_Abutilon_oxycarpum_voucher_1076420545_maturase_K_(matK)_gene_partial_cds_chloroplast                          --------------------------------TCTTTGCATTTATTACGGTTCTCTCTCT
KU508975.1_Acalypha_australis_maturase_K_(matK)_gene_partial_cds_chloroplast                                             AAATTCTTCGATATTGGCTGAAAGATCCCTCTTCTTTGCATTTATTACGACTCTTTCTTC
KC747175.1_Achyranthes_bidentata_bio-material_USDA                                                                       AAACTCTCCGATACTGGTTGAAAGATGCTTCTTCTTTGCATTTATTACGATTCTTTCTTT
KF632783.1_Acorus_calamus_voucher_C998_maturase_K_(matK)_gene_partial_cds_chloroplast                                    AAGTTCTGCAAGGCTGGATACAAGATGTTCCGTCTTTACATTTATTGCGGTTCTTTCTCC
JQ587494.1_Aeschynomene_americana_voucher_BioBot11660_maturase_K_(matK)_gene_partial_cds_chloroplast                     ------------------------------------------------------------
KR735146.1_Ageratum_conyzoides_maturase_K_(matK)_gene_partial_cds_chloroplast                                            ------------------------------------------------------------
GU135030.1_Alternanthera_philoxeroides_voucher_J.R._Abbott_24898_(FLAS)_maturase_K_(matK)_gene_partial_cds_chloroplast   AAACTCTCCGATACTGGTTGAAAGATGCTTCTTCTTTGCATTTATTACGATTCTTTCTTT
JF953164.1_Amaranthus_tricolor_voucher_Z31_maturase_K_(matK)_gene_partial_cds_chloroplast                                ------------------------------------------------GATACTTTCTTT
HM989726.1_Artemisia_argyi_voucher_PS0590MT04_maturase_K_(matK)_gene_partial_cds_chloroplast                             AGGCTCTTCGCTATTGGATAAAAGATGCTTCCTCTTTGCATTTATTAAGATTCTTTCTCC
KF163819.1_Arthraxon_hispidus_voucher_HCCN-PJ008548-PB-280_maturase_K_(matK)_gene_partial_cds_chloroplast                -----------------------GATGTTCCGTCTTTGCMTTTATTGCGATTCWTTCTCC
MG225316.1_Aster_alpinus_voucher_BAB-2621_maturase_K_(matK)_gene_partial_cds_chloroplast                                 -----------------------------TCCTCTTTGCATTTATTAAGATTCTTTCTCC
MF063987.1_Bassia_scoparia_voucher_20160248_maturase_K_(matK)_gene_partial_cds_chloroplast                               -----------------------------------------------CGATTCTTTCTTT
JQ412229.1_Cynodon_dactylon_voucher_BS0132_maturase_K_(matK)_gene_partial_cds_chloroplast                                ---------------------------------------------------TCTTTCTCA
JN895697.1_Bidens_tripartita_isolate_NMW088_maturase_K_(matK)_gene_partial_cds_chloroplast                               ---------------------------CTTCCTCTTTGCATTTATTAAGATTCTTTCTCC


How can I remove these headings, but only keep the species names?

phylip • 1.6k views
1
Entering edit mode

Fancier answers will be forthcoming but this should work. Use your real filename. Save code in a file.

python3 script.py > newfile

ifname="yourfile"

with open(ifname, 'r') as f:
for line in f:
g = line.strip().split('_')
h =(len(g)-1)
seq = g[h].split()
name = g[1]+"_"+g[2]
print(name,seq[1],sep='\t')


Should produce

Abutilon_oxycarpum      --------------------------------TCTTTGCATTTATTACGGTTCTCTCTCT
Acalypha_australis      AAATTCTTCGATATTGGCTGAAAGATCCCTCTTCTTTGCATTTATTACGACTCTTTCTTC
Achyranthes_bidentata       AAACTCTCCGATACTGGTTGAAAGATGCTTCTTCTTTGCATTTATTACGATTCTTTCTTT
Acorus_calamus      AAGTTCTGCAAGGCTGGATACAAGATGTTCCGTCTTTACATTTATTGCGGTTCTTTCTCC
Aeschynomene_americana      ------------------------------------------------------------
Ageratum_conyzoides     ------------------------------------------------------------
Alternanthera_philoxeroides     AAACTCTCCGATACTGGTTGAAAGATGCTTCTTCTTTGCATTTATTACGATTCTTTCTTT
Amaranthus_tricolor     ------------------------------------------------GATACTTTCTTT
Artemisia_argyi     AGGCTCTTCGCTATTGGATAAAAGATGCTTCCTCTTTGCATTTATTAAGATTCTTTCTCC
Arthraxon_hispidus      -----------------------GATGTTCCGTCTTTGCMTTTATTGCGATTCWTTCTCC
Aster_alpinus       -----------------------------TCCTCTTTGCATTTATTAAGATTCTTTCTCC
Bassia_scoparia     -----------------------------------------------CGATTCTTTCTTT
Cynodon_dactylon        ---------------------------------------------------TCTTTCTCA
Bidens_tripartita       ---------------------------CTTCCTCTTTGCATTTATTAAGATTCTTTCTCC
Boehmeria_nivea     ----------------GGTAAAAGACGCCTCCTCTTTGTATTTATTAAGACTTTTTCTTT

0
Entering edit mode

Perhaps it is not capable of simplifying interleaved alignment, this script gave me an list index out of range error,

0
Entering edit mode

You had extra carriage returns between all lines (I am not sure if your original file has them or they were introduced when you copy/pasted the data). I took those out in my copy (and removed them from the post above as well). Try the example above again.

0
Entering edit mode

Thanks, I tried the script in windows powershell, now I'm going to install pycharm to dive into python deeper

0
Entering edit mode

Phylip strict format requires headers less than 10 characters, and the inconsistent spacing that this produces will throw errors, you'd need to pad the deleted space with whitespace to restore the alignment.

@OP, if you have the original sequences, your life will be much easier to edit the headers in the sequence file and then re-align.

1
Entering edit mode

use sed with a file of patterns (option -f)

0
Entering edit mode

I'm not familiar with this command, could you please explain more?

0
Entering edit mode
4.0 years ago
Joe 20k

This doesn't totally answer your query as it won't keep the species name, but a safe way to shorten the names will be to use Biopython:

E.g. if you current file was called long.phy:

>>>  from Bio import AlignIO
>>> AlignIO.convert('long.phy', 'phylip-relaxed', 'short.phy', 'phylip')


Which will give:

 15 60
KM894618.1 ---------- ---------- ---------- --TCTTTGCA TTTATTACGG
KU508975.1 AAATTCTTCG ATATTGGCTG AAAGATCCCT CTTCTTTGCA TTTATTACGA
KC747175.1 AAACTCTCCG ATACTGGTTG AAAGATGCTT CTTCTTTGCA TTTATTACGA
KF632783.1 AAGTTCTGCA AGGCTGGATA CAAGATGTTC CGTCTTTACA TTTATTGCGG
JQ587494.1 ---------- ---------- ---------- ---------- ----------
KR735146.1 ---------- ---------- ---------- ---------- ----------
GU135030.1 AAACTCTCCG ATACTGGTTG AAAGATGCTT CTTCTTTGCA TTTATTACGA
JF953164.1 ---------- ---------- ---------- ---------- --------GA
HM989726.1 AGGCTCTTCG CTATTGGATA AAAGATGCTT CCTCTTTGCA TTTATTAAGA
KF163819.1 ---------- ---------- ---GATGTTC CGTCTTTGCM TTTATTGCGA
MG225316.1 ---------- ---------- ---------T CCTCTTTGCA TTTATTAAGA
MF063987.1 ---------- ---------- ---------- ---------- -------CGA
JQ412229.1 ---------- ---------- ---------- ---------- ----------
JN895697.1 ---------- ---------- -------CTT CCTCTTTGCA TTTATTAAGA
MF350103.1 ---------- ------GGTA AAAGACGCCT CCTCTTTGTA TTTATTAAGA

TTCTCTCTCT
CTCTTTCTTC
TTCTTTCTTT
TTCTTTCTCC
----------
----------
TTCTTTCTTT
TACTTTCTTT
TTCTTTCTCC
TTCWTTCTCC
TTCTTTCTCC
TTCTTTCTTT
-TCTTTCTCA
TTCTTTCTCC
CTTTTTCTTT


Many phylip tools use the strict format still which only allows for 10 character names. As I mentioned in the comments, it would be much easier to make your input sequence or downstream files compliant than to make the Phylip compliant.

If you created a map file of accession numbers -> names, you could use simple sed work later on in any file if you wanted to switch between them at will.

Such as:

Abutilon_oxycarpum           KM894618.1
Acalypha_australis           KU508975.1
Achyranthes_bidentata        KC747175.1
Acorus_calamus               KF632783.1
Aeschynomene_americana       JQ587494.1
Ageratum_conyzoides          KR735146.1
Alternanthera_philoxeroides  GU135030.1
Amaranthus_tricolor          JF953164.1
Artemisia_argyi              HM989726.1
Arthraxon_hispidus           KF163819.1
Aster_alpinus                MG225316.1
Bassia_scoparia              MF063987.1
Cynodon_dactylon             JQ412229.1
Bidens_tripartita            JN895697.1
Boehmeria_nivea              MF350103.1


(((JQ587494.1:0.0,KR735146.1:0.0):5.990225341,((JQ412229.1:0.118111457,((KF632783.1:0.079049581,KF163819.1:0.068672610)0.944:0.180951505,(MG225316.1:0.000000005,(HM989726.1:0.000000005,JN895697.1:0.000000005)0.393:0.000000005)0.894:0.088130659)0.890:0.047666678)0.323:0.000000005,KM894618.1:0.084141325)0.002:0.025703744)0.000:0.020579960,(JF953164.1:0.075087560,((KC747175.1:0.0,GU135030.1:0.0):0.000000005,MF063987.1:0.000000005)0.423:0.000000005)0.219:0.027420688,(MF350103.1:0.197253279,KU508975.1:0.074619017)0.701:0.065875748);


This general form could be used to switch out the names:

while read name accession ; do sed -i "s/$accession/$name/g" mytree.tree ; done < accession_mapfile.txt


to give:

(((Aeschynomene_americana:0.0,Ageratum_conyzoides:0.0):5.990225341,((Cynodon_dactylon:0.118111457,((Acorus_calamus:0.079049581,Arthraxon_hispidus:0.068672610)0.944:0.180951505,(Aster_alpinus:0.000000005,(Artemisia_argyi:0.000000005,Bidens_tripartita:0.000000005)0.393:0.000000005)0.894:0.088130659)0.890:0.047666678)0.323:0.000000005,Abutilon_oxycarpum:0.084141325)0.002:0.025703744)0.000:0.020579960,(Amaranthus_tricolor:0.075087560,((Achyranthes_bidentata:0.0,Alternanthera_philoxeroides:0.0):0.000000005,Bassia_scoparia:0.000000005)0.423:0.000000005)0.219:0.027420688,(Boehmeria_nivea:0.197253279,Acalypha_australis:0.074619017)0.701:0.065875748);


If you really are hell-bent on shortening the names, the closest/most sensible thing I can come up with, without truncating the string, is to form abbreviated latin binomials like so:

sed -i 's/[a-z]*_/\./g' mapfile.txt


But some of these are still too long and I still advise against doing that in the phylip, unless you come up with some relable shortening system.

0
Entering edit mode
4.0 years ago

Hello.

Here is a solution using sed.

$sed -r 's/^[^_]*_([^_]*_[^_]*)[^ \t]*(.*)$/\1\2/' phylip_alignment > shorten


What we are trying to do here, is to replace each line with substrings of the line. To realize this, one have to define the pattern of the string in each line and mark those substrings we want to keep. This is done with ^[^_]*_([^_]*_[^_]*)[^ \t]*(.*)\$. The whole line is than replaced with the substrings that matches the patten in (...). The replacment is done with \1\2.

fin swimmer

0
Entering edit mode

As with the other solutions, while this does format the strings properly, it breaks PHYLIP file convention and will make the resulting phylip difficult to use with other tools.