Question: Extracting 16S rRNA region from bacterial genomes
gravatar for ARich
6 months ago by
United States
ARich90 wrote:

Dear Biostar Users,

I would like to generate phylogenetic tree from ~1000 bacterial genomes. For this purpose i would like to extact highly conserved 16S rRNA region of these genomes.

The information I have is something like below, where I have genome name and NC id.

Acaryochloris marina MBIC11017, NC_009925

Is there any way to perform automated extaction of 16S conserved region for these ~1000 genomes.

Looking forward for a solution.


ADD COMMENTlink written 6 months ago by ARich90

I am not answering your question directly but want to mention an alternate option. You may want to download the 16S RNA blast indexes made available by NCBI here (Warning: large download). Use blastdbcmd from blast+ to dump fasta format sequence out and then pick out ones you need. This is a curated dataset and likely will have the best sequences available for organisms you can find.

ADD REPLYlink modified 6 months ago • written 6 months ago by GenoMax95k

Thank you for the suggestion. I tries your suggestion as below: 1. I first downloaded all the 16S RNA database from NCBI . 2. Then I using my genome list tries to extract the 16S sequences for the given genome list using following command blastdbcmd -db \ 16S_ribosomal_RNA \ -entry all \ -outfmt "%g;;%t" | \ grep -F "${MY_GENOME-LIST}" | \ awk -F";;" '/16S \ /{print $1}' | \ blastdbcmd -db 16S_ribosomal_RNA \ -entry_batch - \ -out seq.fasta

The problem here i have names for 400 genomes names in the file but in the end I am able to extract sequences for only 200. I did check why this is happening basically some of the enteries of genomes are missing in this ncbi database which inturn is missed in grep -F step. So, the question is, is it normal that the 16S database from NCBI is missing 16S regions entries for some of the genomes?

Thank you in advance.

ADD REPLYlink modified 5 months ago • written 5 months ago by ARich90

16S rRNA blast database indexes are representative and curated, i.e. they do not contain every genome available in NCBI database.

ADD REPLYlink modified 5 months ago • written 5 months ago by GenoMax95k

I guess the easiest way would be to download their annotated genomes with eutils and extract the 16S regions. Maybe Silla couldbe useful but I'm not sure it links to the genome refseq IDs.

ADD REPLYlink written 6 months ago by Asaf8.5k

Thank you for the reply. By downloading the whole genome and then extracting 16SrRNA would not be over engineer? Do you know a way where i can just extract only 16S rRNA. I am not sure if Silva contain information for all genomes I am working with but I am sure that NCBI (genbank /refseq) have these genomes.

Cant we directly extact 16S from eutils? if yes then how?

Many thanks!

ADD REPLYlink written 6 months ago by ARich90

I don't know since it's assemblies. I'm not sure if you can download just the .ffs files.

ADD REPLYlink written 6 months ago by Asaf8.5k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1627 users visited in the last hour