Entering edit mode
6 months ago
A_heath ▴ 120
I have the 16S rRNA sequencing of a bacterial sample and I would like to align that sequencing to its reference (from its genome) and obtain a % of identity.
What you be the best tool to achieve that in your opinion?
I thought about Clustal, or if you know a more appropriate tool, please let me know!
Thank you for your help,
Isn't % identity a vague criterion? Is the sequencing data from the same strain? Is so it should map at 100% identity. If you have fastq data then you can use an NGS aligner to identify reads that align. You can then extract those in fasta format and use
clustalto get a MSA.
Thank you very much for your reply. The sequencing data is from the same strain but I would like to find an alignment tool that provides a % of identity (that is close to 100% indeed). I have both sequences in fasta formats. Using clustal, I do not have a % of identity value unfortunately
You are looking for % identity for each read or entire alignment?
I am only looking for a % of identity for the entire alignment of the query and reference fasta sequences. I also tried Needle (https://www.ebi.ac.uk/Tools/psa/emboss_needle/)
You can just blast them with Needleman-Wunsch Global Align. (If the deep link doesn't work for you, scroll down on the main landing page and click on Global Align).
PS: Why did Needle not suit your needs? It does exactly the same thing using the same algorithm...
Thank you very much for your help.
The thing is that when these algorithms compare the two nucleotide sequences I have, they do not consider similarity between an A nucleotide and a D (which is either A, T, or G). Thus I have a % of similarity that is 99% instead of 100%. What do you think about it? Should I consider that A --> D is not similar?
Technically, if you run needle on the command line, you can provide your own scoring matrix with the
-datafileparameter. So you can give an
A-Dpairing the same score as an
A-A, if you like. The default scoring matrix is shown here and does penalize an
A-Dpairing, but not as much as a clear mismatch.
Which interpretation is more accurate with regard to your biological problem is a different story. Intuitively, my main consideration would be, if D is in the reference or your query? If it is part of the query, I'd count it as match (like a Regex, where ADT would match AAT, AGT and ATT equally well), whereas I would maybe not consider the position for scoring if D is in the reference and A is in your query?
Thank you very much for these really helpful information.
I run needle from the command line so I might modify the scoring matrix. In my reference sequence I only have A, T, G, or C nucleotides. However, in my query sequence (as the sequencing has inaccuracies at some positions), I sometimes have D, K, R, etc.