Species identifcation using 16s rRNA
4
1
Entering edit mode
8 months ago
A_heath ▴ 60

Hi all,

I have a fasta file of an assembled genome of a bacterial strain. I would like to identify the species of this strain (or the closest one possible) using the 16s rRNA sequence.

I tried downloading multiple 16s rRNA sequences from the NCBI database and to perform blastn analysis but I don't think I am using the most efficient method.

Do you have any recommendations on what tool to use to identify the species of a particular strain using 16s rRNA ?

Audrey

16s rRNA identification species fasta • 632 views
1
Entering edit mode

try Type Strain Genome Server https://tygs.dsmz.de/

4
Entering edit mode
8 months ago

Try this:

Download the bacterial SSU model from RFAM: https://rfam.org/family/RF00177#tabview=tab9 Then use the .cm file with the tool cmsearch of the Infernal suite to locate the SSU in the genome, extract the genome sequence at the best scoring coordinates and submit to NCBI Blast or blast against SILVA. You can repeat that with the LSU as well if you like. If you want to blast the rest of the genome to gain further confidence, then you can use the taxon obtained by SSU search to restrict the blast taxon range and thereby speed up the blast search.

Hope this works out for you.

0
Entering edit mode

I tried as you explained using Rfam bacterial LSU (RF02541) and SSU (RF00177). I typed the following command line:

cmsearch -o output LSU.cm genome.fasta


Do you think I need to calibrate first?

And I have an output file with genomic coordinates of the regions of interest. I tried blasting them on the NCBI database and it works perfectly.

However, I have troubles using SILVA. Do you know where I can blast my sequence on this website?

Thank you again for your precious help!

1
Entering edit mode

Hi Audrey,

Good to see this works, note also the Caveat about strain identification by 5heikki, the rDNA gene might not be informative for differentiating strains, but it should work fine at least on the genus level. You might have to look at whole genome alignments to identify the key variants that identify the closest strain. As far as I know the calibration only affects the e-value and corrects for sequence composition, but as we are just interested in the best hit(s), it is not really necessary here, but it should pose a problem to first run the calibration (again).

You can search SILVA here: https://www.arb-silva.de/aligner/ (it's not blast but works in a similar way)

0
Entering edit mode

Thank you for your help, it is so useful. Have a nice day

4
Entering edit mode
8 months ago
5heikki 10.0k

This is a good 16S reference file: https://www.arb-silva.de/fileadmin/silva_databases/current/Exports/SILVA_138.1_SSURef_tax_silva.fasta.gz

And the map that goes with it: https://www.arb-silva.de/fileadmin/silva_databases/current/Exports/taxonomy/taxmap_slv_ssu_ref_138.1.txt.gz

Although you may want to read the SILVA site first to determine if you want these files or perhaps e.g. a non-redundant higher quality version

https://www.arb-silva.de/

Blasting against the fasta file is fine as long as you place proper thresholds for hit reporting. Also, it will obviously be way faster if you're just blasting the 16S sequence and not the entire genome assembly (for ideas how to extract see Michael's reply)

That being said, I don't think 16S is the way to go for species identification, especially if you want strain/serotype level resolution

0
Entering edit mode

With the 16S reference file that you suggested, I downloaded it and made a database to blast it against my genome file. Is that what I should do to identify the species?

1
Entering edit mode

If you're fine with species-level resolution and see a full length hit with ~ 99.9% similarity then that's your species (with perhaps some exceptions like e.g. E.coli/Shigella). For higher level resolution (strain/serotype/subspecies/etc.) 16S isn't enough..

0
Entering edit mode

Yes that's right, well thank you so much for your help. Have a nice day,

1
Entering edit mode
8 months ago
harishk0201 ▴ 110

Other than what both 5heikki and Michael have said, I think I might also have a solution?

You can also use barrnap to get the rRNA sequences, then blast/cmsearch against either RFAM or NCBI database.

1
Entering edit mode

Another option is to use RNAmmer web server/commandline tool for which barnap should be a replacement, one might argue that one or the other software is more accurate, but in this case, they will all get the job done.

0
Entering edit mode

Absolutely, I was just offering an alternative since RNAmmer never really worked for me!

Have a great day!

1
Entering edit mode

Also, RNAmmer comes with a terrible bogus license that makes it almost impossible to use.

0
Entering edit mode
8 months ago
the_dummy ▴ 30

First, I thought about NCBI 16S RefSeq Database. Is it ok to blast a whole genome sequence to this type of database?

related image

If it is ok, you could try it.

1
Entering edit mode

Well, you will definitely save some CPU time if you identify the rDNA sequence first. If your genome is too large, you might not even get the job started. The processing by cmsearch takes likely the same amount of time or less than uploading the genome.

0
Entering edit mode

That DB has only like 20k sequences or so..