Question

Align 16S sequence to a reference

1

Entering edit mode

19 months ago

A_heath ▴ 160

Hi all,

I have the 16S rRNA sequencing of a bacterial sample and I would like to align that sequencing to its reference (from its genome) and obtain a % of identity.

What you be the best tool to achieve that in your opinion?

I thought about Clustal, or if you know a more appropriate tool, please let me know!

Thank you for your help,

16S sequence alignment • 1.3k views

ADD COMMENT • link updated 19 months ago by Matthias Zepper 4.5k • written 19 months ago by A_heath ▴ 160

1

Entering edit mode

Isn't % identity a vague criterion? Is the sequencing data from the same strain? Is so it should map at 100% identity. If you have fastq data then you can use an NGS aligner to identify reads that align. You can then extract those in fasta format and use clustal to get a MSA.

ADD REPLY • link 19 months ago by GenoMax 141k

1

Entering edit mode

Thank you very much for your reply. The sequencing data is from the same strain but I would like to find an alignment tool that provides a % of identity (that is close to 100% indeed). I have both sequences in fasta formats. Using clustal, I do not have a % of identity value unfortunately

ADD REPLY • link 19 months ago by A_heath ▴ 160

1

Entering edit mode

You are looking for % identity for each read or entire alignment?

ADD REPLY • link 19 months ago by GenoMax 141k

1

Entering edit mode

I am only looking for a % of identity for the entire alignment of the query and reference fasta sequences. I also tried Needle (https://www.ebi.ac.uk/Tools/psa/emboss_needle/)

ADD REPLY • link 19 months ago by A_heath ▴ 160

2

Entering edit mode

You can just blast them with Needleman-Wunsch Global Align. (If the deep link doesn't work for you, scroll down on the main landing page and click on Global Align).

PS: Why did Needle not suit your needs? It does exactly the same thing using the same algorithm...

ADD REPLY • link 19 months ago by Matthias Zepper 4.5k

1

Entering edit mode

Thank you very much for your help.

The thing is that when these algorithms compare the two nucleotide sequences I have, they do not consider similarity between an A nucleotide and a D (which is either A, T, or G). Thus I have a % of similarity that is 99% instead of 100%. What do you think about it? Should I consider that A --> D is not similar?

ADD REPLY • link 19 months ago by A_heath ▴ 160

2

Entering edit mode

Technically, if you run needle on the command line, you can provide your own scoring matrix with the -datafile parameter. So you can give an A-D pairing the same score as an A-A, if you like. The default scoring matrix is shown here and does penalize an A-D pairing, but not as much as a clear mismatch.

Which interpretation is more accurate with regard to your biological problem is a different story. Intuitively, my main consideration would be, if D is in the reference or your query? If it is part of the query, I'd count it as match (like a Regex, where ADT would match AAT, AGT and ATT equally well), whereas I would maybe not consider the position for scoring if D is in the reference and A is in your query?

ADD REPLY • link 19 months ago by Matthias Zepper 4.5k

2

Entering edit mode

Thank you very much for these really helpful information.

I run needle from the command line so I might modify the scoring matrix. In my reference sequence I only have A, T, G, or C nucleotides. However, in my query sequence (as the sequencing has inaccuracies at some positions), I sometimes have D, K, R, etc.

ADD REPLY • link 19 months ago by A_heath ▴ 160