Question

What software could be used to calculate substitution rate of a pair of nucleic acid sequences?

2

Entering edit mode

10.0 years ago

Cacau ▴ 520

Does anyone know how to calculate the substitution rate (e.g. how many substitution per site) for a pair or a bunch of nucleic acid sequences? Thanks a lot.

alignment sequence • 5.9k views

ADD COMMENT • link updated 2.6 years ago by Ram 43k • written 10.0 years ago by Cacau ▴ 520

Ram · Answer 1 · 2014-04-25

Hi cacaucenturion2,

I have written a usearch_aln_transition_prob.py that you can use to generate transition probabilities PROVIDED IF you have used usearch to generate alignment files (*.aln) by matching your sequences against a reference database.

# usearch -usearch_global nested_pcr_ATCACG_L001_R1_001_trim.fasta -db /home/uzi/TSB_AMPLICONS_REFERENCE/Mercier_16S_reference_251013C.fasta -id 0.95 -alnout nested_pcr_ATCACG_L001_R1_001_trim.aln -strand plus -userout nested_pcr_ATCACG_L001_R1_001_trim.ualn -userfields query+target+id+alnlen+mism+opens+qlo+qhi+tlo+thi+evalue+bits -matched nested_pcr_ATCACG_L001_R1_001_trim_matched.fasta -notmatched nested_pcr_ATCACG_L001_R1_001_trim_notmatched.fasta -threads 5
# usearch_i86linux32 v6.0.307, 4.0Gb RAM (529Gb total), 64 cores
# 
# Query >MISEQ:14:000000000-A1FNW:1:1101:15022:1463 1:N:0:ATCACG
#  %Id   TLen  Target
#  99%   1144  Chlorobiumtepidum_TLS_hap1
# 
#  Query  150nt >MISEQ:14:000000000-A1FNW:1:1101:15022:1463 1:N:0:ATCACG
# Target 1144nt >Chlorobiumtepidum_TLS_hap1
# 
# Qry    1 + GTGCCAGCAGCCGCGGTGATACAGGGGTGGCAAGCGTTGTCCGGATTTACTGGGTGTAAAGGGT 64
#            ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
# Tgt  259 + GTGCCAGCAGCCGCGGTGATACAGGGGTGGCAAGCGTTGTCCGGATTTACTGGGTGTAAAGGGT 322
# 
# Qry   65 + GCGCAGGCGGATCGATAAGTCGGGGGTTAAATCCATGTGCTTAACACATGTACGGCTTCCGATA 128
#            |||||||||||||||||||||||||||||||||||||||||||||||||| |||||||||||||
# Tgt  323 + GCGCAGGCGGATCGATAAGTCGGGGGTTAAATCCATGTGCTTAACACATGCACGGCTTCCGATA 386
# 
# Qry  129 + CTGTTGATCTAGAGTCTCGAAG 150
#            ||||||||||||||||||||||
# Tgt  387 + CTGTTGATCTAGAGTCTCGAAG 408
# 
# 150 cols, 149 ids (99.3%), 0 gaps (0.0%)
# 
# Query >MISEQ:14:000000000-A1FNW:1:1101:16372:1472 1:N:0:ATCACG
#  %Id   TLen  Target
#  99%   1542  Dickeya_dadantii_3937_chromosome_NC_014500.1_complete_genome_6
# 
#  Query  150nt >MISEQ:14:000000000-A1FNW:1:1101:16372:1472 1:N:0:ATCACG
# Target 1542nt >Dickeya_dadantii_3937_chromosome_NC_014500.1_complete_genome_6
# 
# Qry    1 + GTGCCAGCCGCCGCGGTAATACGGAGGGTGCAAGCGTTAATCGGAATGACTGGGCGTAAAGCGC 64
#            ||| |||| |||||||||||||||||||||||||||||||||||||||||||||||||||||||
# Tgt  515 + GTGTCAGCAGCCGCGGTAATACGGAGGGTGCAAGCGTTAATCGGAATGACTGGGCGTAAAGCGC 578
# 
# Qry   65 + ACGCAGGCGGTCTGTTAAGTTGGATGTGAAATCCCCGGGCTTAACCTGGGAACTGCATTCAAAA 128
#            ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
# Tgt  579 + ACGCAGGCGGTCTGTTAAGTTGGATGTGAAATCCCCGGGCTTAACCTGGGAACTGCATTCAAAA 642
# 
# Qry  129 + CTGACAGGCTAGAGTCTCGTAG 150
#            ||||||||||||||||||||||
# Tgt  643 + CTGACAGGCTAGAGTCTCGTAG 664
# 
# 150 cols, 148 ids (98.7%), 0 gaps (0.0%)
# </TEST.aln>
#
#        Steps involved:
#               (1): We first parse the file such that we populate four strings per record, i.e.,
#            query_name,reference_name, query_string, reference_string
#       (2): For each record, multiline "Qry" and "Tgt" get concatentated together to construct query_string and reference_string, respectively
#       (3): Once whole record is assembled, we use calculate_transition_counts() to calculate transition probabilities

You can filter these transitions to get substitution rates. Now to calculate per-site substitution rates, you need to do a bit of modification to my code:

especially, this function from above script:

def calculate_transition_counts(query,reference,kmer_size,bidirectional):
        for i in range(len(query)-(kmer_size-1)):
                 lk=query[i:i+kmer_size].upper()
                 rk=reference[i:i+kmer_size].upper()   
         if bidirectional==1:
                    if transition_table.get(lk+':'+rk,None)==None:
                if transition_table.get(rk+':'+lk,None)==None:
                    transition_table[rk+':'+lk]= 1
                else:
                    transition_table[rk+':'+lk]=transition_table[rk+':'+lk]+1
            else:
                transition_table[lk+':'+rk]=transition_table[lk+':'+rk]+1
         else:
            if transition_table.get(lk+':'+rk,None)==None:
                transition_table[lk+':'+rk]= 1
            else:
                transition_table[lk+':'+rk]=transition_table[lk+':'+rk]+1

If you have a file.bam, you can use the following one-liner to calculate substitution rate (I think it is correct) against reference.fa:

$ samtools view -b file.bam | samtools fillmd - e - reference.fa | awk -F"\t" '{a=$10;gsub("=","",a);b+=length(a);c +=length($10)} END {print (b/c)*100}'"

Best Wishes,
Umer

Ram · Answer 2 · 2014-04-25

Your question is a bit misleading because you mention calculating "substitution rates" several times but then say you are wanting to calculate the number of changes per site. It sounds like you are not actually wanting to calculate substitution rates but more likely you are interested in Ti/Tv (transition/transversion) ratios, which can be biologically informative about how a gene is evolving. Either way, you can use PAML or HYPHY for this task.

For just getting ML estimates of these ratios, you can use baseml (part of PAML) with an appropriate model of evolution. The input for this analysis would be a multiple-sequence alignment, and how you construct the alignment is important because it will strongly influence your results. If you are wanting to calculate Ks, you probably will want to use a codon model but more information about what data you have and what specifically you are trying to do would be needed to advise further.