What software could be used to calculate substitution rate of a pair of nucleic acid sequences?
2
2
Entering edit mode
8.6 years ago
Cacau ▴ 500

Does anyone know how to calculate the substitution rate (e.g. how many substitution per site) for a pair or a bunch of nucleic acid sequences? Thanks a lot.

alignment sequence • 5.1k views
2
Entering edit mode
8.6 years ago

Hi cacaucenturion2,

I have written a usearch_aln_transition_prob.py that you can use to generate transition probabilities PROVIDED IF you have used usearch to generate alignment files (*.aln) by matching your sequences against a reference database.

# usearch -usearch_global nested_pcr_ATCACG_L001_R1_001_trim.fasta -db /home/uzi/TSB_AMPLICONS_REFERENCE/Mercier_16S_reference_251013C.fasta -id 0.95 -alnout nested_pcr_ATCACG_L001_R1_001_trim.aln -strand plus -userout nested_pcr_ATCACG_L001_R1_001_trim.ualn -userfields query+target+id+alnlen+mism+opens+qlo+qhi+tlo+thi+evalue+bits -matched nested_pcr_ATCACG_L001_R1_001_trim_matched.fasta -notmatched nested_pcr_ATCACG_L001_R1_001_trim_notmatched.fasta -threads 5
# usearch_i86linux32 v6.0.307, 4.0Gb RAM (529Gb total), 64 cores
#
# Query >MISEQ:14:000000000-A1FNW:1:1101:15022:1463 1:N:0:ATCACG
#  %Id   TLen  Target
#  99%   1144  Chlorobiumtepidum_TLS_hap1
#
#  Query  150nt >MISEQ:14:000000000-A1FNW:1:1101:15022:1463 1:N:0:ATCACG
# Target 1144nt >Chlorobiumtepidum_TLS_hap1
#
# Qry    1 + GTGCCAGCAGCCGCGGTGATACAGGGGTGGCAAGCGTTGTCCGGATTTACTGGGTGTAAAGGGT 64
#            ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
# Tgt  259 + GTGCCAGCAGCCGCGGTGATACAGGGGTGGCAAGCGTTGTCCGGATTTACTGGGTGTAAAGGGT 322
#
# Qry   65 + GCGCAGGCGGATCGATAAGTCGGGGGTTAAATCCATGTGCTTAACACATGTACGGCTTCCGATA 128
#            |||||||||||||||||||||||||||||||||||||||||||||||||| |||||||||||||
# Tgt  323 + GCGCAGGCGGATCGATAAGTCGGGGGTTAAATCCATGTGCTTAACACATGCACGGCTTCCGATA 386
#
# Qry  129 + CTGTTGATCTAGAGTCTCGAAG 150
#            ||||||||||||||||||||||
# Tgt  387 + CTGTTGATCTAGAGTCTCGAAG 408
#
# 150 cols, 149 ids (99.3%), 0 gaps (0.0%)
#
# Query >MISEQ:14:000000000-A1FNW:1:1101:16372:1472 1:N:0:ATCACG
#  %Id   TLen  Target
#
#  Query  150nt >MISEQ:14:000000000-A1FNW:1:1101:16372:1472 1:N:0:ATCACG
#
# Qry    1 + GTGCCAGCCGCCGCGGTAATACGGAGGGTGCAAGCGTTAATCGGAATGACTGGGCGTAAAGCGC 64
#            ||| |||| |||||||||||||||||||||||||||||||||||||||||||||||||||||||
# Tgt  515 + GTGTCAGCAGCCGCGGTAATACGGAGGGTGCAAGCGTTAATCGGAATGACTGGGCGTAAAGCGC 578
#
# Qry   65 + ACGCAGGCGGTCTGTTAAGTTGGATGTGAAATCCCCGGGCTTAACCTGGGAACTGCATTCAAAA 128
#            ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
# Tgt  579 + ACGCAGGCGGTCTGTTAAGTTGGATGTGAAATCCCCGGGCTTAACCTGGGAACTGCATTCAAAA 642
#
# Qry  129 + CTGACAGGCTAGAGTCTCGTAG 150
#            ||||||||||||||||||||||
# Tgt  643 + CTGACAGGCTAGAGTCTCGTAG 664
#
# 150 cols, 148 ids (98.7%), 0 gaps (0.0%)
# </TEST.aln>
#
#        Steps involved:
#               (1): We first parse the file such that we populate four strings per record, i.e.,
#            query_name,reference_name, query_string, reference_string
#       (2): For each record, multiline "Qry" and "Tgt" get concatentated together to construct query_string and reference_string, respectively
#       (3): Once whole record is assembled, we use calculate_transition_counts() to calculate transition probabilities


You can filter these transitions to get substitution rates. Now to calculate per-site substitution rates, you need to do a bit of modification to my code:

especially, this function from above script:

def calculate_transition_counts(query,reference,kmer_size,bidirectional):
for i in range(len(query)-(kmer_size-1)):
lk=query[i:i+kmer_size].upper()
rk=reference[i:i+kmer_size].upper()
if bidirectional==1:
if transition_table.get(lk+':'+rk,None)==None:
if transition_table.get(rk+':'+lk,None)==None:
transition_table[rk+':'+lk]= 1
else:
transition_table[rk+':'+lk]=transition_table[rk+':'+lk]+1
else:
transition_table[lk+':'+rk]=transition_table[lk+':'+rk]+1
else:
if transition_table.get(lk+':'+rk,None)==None:
transition_table[lk+':'+rk]= 1
else:
transition_table[lk+':'+rk]=transition_table[lk+':'+rk]+1


If you have a file.bam, you can use the following one-liner to calculate substitution rate (I think it is correct) against reference.fa:

$samtools view -b file.bam | samtools fillmd - e - reference.fa | awk -F"\t" '{a=$10;gsub("=","",a);b+=length(a);c +=length(\$10)} END {print (b/c)*100}'"


Best Wishes,
Umer

0
Entering edit mode

Wow, great work here, useful indeed.

0
Entering edit mode

I think this answers the question but to be clear, it looks like you are calculating substitution percentages not substitution rate (please correct me if I'm wrong). For substitution rate, you need a phylogeny, a model of evolution, and preferably calibration points (other knowledge such as effective population size to go into the coalescent model is helpful).

0
Entering edit mode

Not really, I think he is just trying to assess sequencing quality. Like what I am doing in Glasgow by comparing different platforms/library preparation techniques and then looking at alignments to see Indel and Substitution rates and then making an informed guess as to how well the NGS platform or library preparation technique performed.

Here are reference slides (check page 11) by my supervisor to make it clear: https://stamps.mbl.edu/images/9/9a/NoiseRemoval2013.pdf

1
Entering edit mode

Part of the problem is that it is not completely clear what is being asked, whether it is simply about comparing substitutions or if there is actually some biological question here about variation and rates (I would assume this is the case, as the title implies, but we don't have a lot of information).

My point is that the term "substitution rate" has biological meaning, and you are not calculating the substitution rate by comparing NGS platforms or by simply comparing to a reference. I think you are referring to artifacts or "noise" as differences in substitution rate, but that is very confusing from a biological perspective because it is not related to the process/mechanisms of rate variation. I don't disagree with your assessment at all, but I think we are referring to very different things and using the same terminology, and that was what I was trying to clarify.

0
Entering edit mode

Yeah you are right about "substitution rate" associated with biological meaning in common parlance. Sorry, I should have been more clear that those of us who work with Amplicons, especially 16S/18S rRNA, use this term in the context of sequencing error. When processing Amplicons data for environmental samples where estimating diversity is the main goal, it is vital to distinguish noise from true sequence diversity otherwise you get inflated estimates of number of types of Operational Taxonomic Units (OTUs) a proxy for species if sequences are clustered with 97% similarity for Amplicons. Three sources of errors are there: sequencing error; PCR single base substitutions, and PCR chimeras and most noise removal software (using bayesian methodology) have a priori models on error rates (substitutions, indels) built into the algorithms.

1
Entering edit mode

Thanks for the response, hopefully our discussion will be helpful because these are all important considerations.

2
Entering edit mode
8.6 years ago
SES 8.5k

Your question is a bit misleading because you mention calculating "substitution rates" several times but then say you are wanting to calculate the number of changes per site. It sounds like you are not actually wanting to calculate substitution rates but more likely you are interested in Ti/Tv (transition/transversion) ratios, which can be biologically informative about how a gene is evolving. Either way, you can use PAML or HYPHY for this task.

For just getting ML estimates of these ratios, you can use baseml (part of PAML) with an appropriate model of evolution. The input for this analysis would be a multiple-sequence alignment, and how you construct the alignment is important because it will strongly influence your results. If you are wanting to calculate Ks, you probably will want to use a codon model but more information about what data you have and what specifically you are trying to do would be needed to advise further.