Question: Statistics of amino acid usage of multiple aligned sequences?
0
gravatar for johnnytam100
5 weeks ago by
johnnytam100100
johnnytam100100 wrote:

I would like to know if there is a tool which could do statistics of amino acid usage of multiple aligned sequences?

I can do it in excel, just want to know if there is a tool conveniently do the job.

Thanks!

alignment • 117 views
ADD COMMENTlink modified 5 weeks ago by jrj.healey11k • written 5 weeks ago by johnnytam100100

If the input files are FASTA, then you could narrow your search to a tool that summarizes amino acid usage in FASTA files.

ADD REPLYlink written 5 weeks ago by jean.elbers970

Yes, the input is an aligned .fasta file.

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by johnnytam100100

What is it exactly you want to know?

Are these hypothetical nucleotide or protein alignments?

ADD REPLYlink written 5 weeks ago by jrj.healey11k

I want to know, for each aligned position, the percentage of usage of amino acid species e.g. 80% A, 10% G, 10% V

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by johnnytam100100
0
gravatar for jrj.healey
5 weeks ago by
jrj.healey11k
United Kingdom
jrj.healey11k wrote:

You can probably use something like Jalview to get what you want, but to my mind, it doesn't get much easier than:

from Bio import AlignIO
from collections import Counter
import sys

aln = AlignIO.read(sys.argv[1], 'phylip')

for i in range(aln.get_alignment_length()):
    print(Counter(aln[:, i]))

e.g. given this input alignment:

    16    149
PAU_02775  MSTTPEQIAV EYPIPTYRFV VSLGDEQIPF NSVSGLDISH DVIEYKDGTG
PLT_01696  MSTTPEQIAV EYPIPTYRFV VSIGDEQIPF NSVSGLDISH DVIEYKDGTG
PAK_02606  MSTTPEQIAV EYPIPTYRFV VSIGDEQVPF NSVSGLDISH DVIEYKDGTG
PLT_01736  MSTTPEQIAV EYPIPTYRFV VSIGDEKVPF NSVSGLDISH DVIEYKDGTG
PAK_01896  MTTTT----V DYPIPAYRFV VSVGDEQIPF NNVSGLDITY DVIEYKDGTG
PAU_02074  MATTT----V DYPIPAYRFV VSVGDEQIPF NSVSGLDITY DVIEYKDGTG
PLT_02424  MSVTTEQIAV DYPIPTYRFV VSVGDEQIPF NNVSGLDITY DVIEYKDGTG
PLT_01716  MTITPEQIAV DYPIPAYRFV VSVGDEKIPF NNVSGLDVHY DVIEYKDGTG
PLT_01758  MAITPEQIAV EYPIPTYRFV VSVGDEQIPF NNVSGLDVHY DVIEYKDGIG
PAK_03203  MSTSTSQIAV EYPIPVYRFI VSIGDDQIPF NSVSGLDINY DTIEYRDGVG
PAU_03392  MSTSTSQIAV EYPIPVYRFI VSVGDEKIPF NSVSGLDISY DTIEYRDGVG
PAK_02014  MSITQEQIAA EYPIPSYRFM VSIGDVQVPF NSVSGLDRKY EVIEYKDGIG
PAU_02206  MSITQEQIAA EYPIPSYRFM VSIGDVQVPF NSVSGLDRKY EVIEYKDGIG
PAK_01787  MSTTADQIAV QYPIPTYRFV VTIGDEQMCF QSVSGLDISY DTIEYRDGVG
PAU_01961  MSTTADQIAV QYPIPTYRFV VTIGDEQMCF QSVSGLDISY DTIEYRDGVG
PLT_02568  MSTTVDQIAV QYPIPTYRFV VTVGDEQMSF QSVSGLDISY DTIEYRDGIG

           NYYKMPGQRQ AINISLRKGV FSGDTKLFDW INSIQLNQVE KKDISISLTN
           NYYKMPGQRQ AINISLRKGV FSGDTKLFDW INSIQLNQVE KKDISISLTN
           NYYKMPGQRQ AINISLRKGV FSGDTKLFDW INSIQLNQVE KKDISISLTN
           NYYKMPGQRQ AINITLRKGV FSGDTKLFDW LNSIQLNQVE KKDISISLTN
           NYYKMPGQRQ LINITLRKGV FPGDTKLFDW LNSIQLNQVE KKDVSISLTN
           NYYKMPGQRQ LINITLRKGV FPGDTKLFDW LNSIQLNQVE KKDVSISLTN
           NHYKMPGQRQ LINITLRKGV FPGDTKLFDW LNSIQLNQVE KKDVSISLTN
           NYYKMPGQRQ SINITLRKGV FPGDTKLFDW INSIQLNQVE KKDIAISLTN
           NYYKMPGQRQ SINITLRKGV FPGDTKLFDW INSIQLNQVE KKDIAISLTN
           NWFKMPGQSQ LVNITLRKGV FPGKTELFDW INSIQLNQVE KKDITISLTN
           NWFKMPGQSQ STNITLRKGV FPGKTELFDW INSIQLNQVE KKDITISLTN
           NYYKMPGQIQ RVDITLRKGI FSGKNDLFNW INSIELNRVE KKDITISLTN
           NYYKMPGQIQ RVDITLRKGI FSGKNDLFNW INSIELNRVE KKDITISLTN
           NWLQMPGQRQ RPTITLKRGI FKGQSKLYDW INSISLNQIE KKDISISLTD
           NWLQMPGQRQ RPTITLKRGI FKGQSKLYDW INSISLNQIE KKDISISLTD
           NWLQMPGQRQ RPSITLKRGI FKGQSKLYDW INSISLNQIE KKDISISLTD

           EAGTEILMTW SVANAFPTSL TSPSFDATSN EVAVQEITLT ADRVTIQAA
           EAGTEILMTW SVANAFPTSL ISPSFDATSN EVAVQEITLT ADRVTIQAA
           EAGTEILMTW SVANAFPTSL TSPSFDATSN EVAVQEITLT ADRVTIQAA
           EAGTEILMTW SVANAFPTSL TAPAFDATSN EVAVQEISLT ADRVTIQAA
           ETGTEILMSW SVANAFPTSL TSPSFDATSN DIAVQEIKLT ADRVTIQAA
           EVGTEILMTW SVANAFPTSL TSPSFDATSN DIAVQEIKLT ADRVTIQAA
           EAGTEILMSW SVANAFPTSL TSPSFDATSN DIAVQEIKLT ADRVMIQAA
           ETGSQILMTW NVANAFPTSF TSPSFDAASN DIAIQEIALV ADRVTIQAP
           EAGTEILMTW NVANAFPTSF TSPSFDATSN EIAVQEIALT ADRVTIQAA
           DAGTELLMTW NVSNAFPTSL TSPSFDATSN DIAVQEITLT ADRVIMQAV
           DAGTELLMTW NVSNAFPTSL TSPSFDATSN DIAVQEITLM ADRVIMQAV
           DTGSEVLMSW VVSNAFPSSL TAPSFDASSN EIAVQEISLV ADRVTIQVP
           DTGSKVLMSW VVSNAFPSSL TAPSFDASSN EIAVQEISLV ADRVTIQVP
           ETGSNLLITW NIANAFPEKL TAPSFDATSN EVAVQEMSLK ADRVTVEFH
           ETGSNLLITW NIANAFPEKL TAPSFDATSN EVAVQEISLK ADRVTVEFH
           ETGSNLLITW NIANAFPEKL TAPSFDATSN EVAVQEISLK ADRVTVEFH

The result would be:

$ python script.py inputseqs.phy
Counter({'M': 16})
Counter({'S': 12, 'A': 2, 'T': 2})
 ... # Truncated output to stay in post character limit.
Counter({'V': 16})
Counter({'T': 13, 'I': 2, 'M': 1})
Counter({'I': 11, 'V': 3, 'M': 2})
Counter({'Q': 13, 'E': 3})
Counter({'A': 11, 'F': 3, 'V': 2})
Counter({'A': 8, 'P': 3, 'H': 3, 'V': 2})

If you want to ignore gaps, you'll have to do something slightly different.

ADD COMMENTlink written 5 weeks ago by jrj.healey11k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1493 users visited in the last hour