Extracting 16S rRNA region from bacterial genomes
0
0
Entering edit mode
23 months ago
ARich ▴ 100

Dear Biostar Users,

I would like to generate phylogenetic tree from ~1000 bacterial genomes. For this purpose i would like to extact highly conserved 16S rRNA region of these genomes.

The information I have is something like below, where I have genome name and NC id.

Acaryochloris marina MBIC11017, NC_009925

Is there any way to perform automated extaction of 16S conserved region for these ~1000 genomes.

Looking forward for a solution.

Thanks

genome 16S rRNA Phylogenetic tree • 1.1k views
1
Entering edit mode

I am not answering your question directly but want to mention an alternate option. You may want to download the 16S RNA blast indexes made available by NCBI here (Warning: large download). Use blastdbcmd from blast+ to dump fasta format sequence out and then pick out ones you need. This is a curated dataset and likely will have the best sequences available for organisms you can find.

0
Entering edit mode

Thank you for the suggestion. I tries your suggestion as below: 1. I first downloaded all the 16S RNA database from NCBI . 2. Then I using my genome list tries to extract the 16S sequences for the given genome list using following command blastdbcmd -db \ 16S_ribosomal_RNA \ -entry all \ -outfmt "%g;;%t" | \ grep -F "${MY_GENOME-LIST}" | \ awk -F";;" '/16S \ /{print$1}' | \ blastdbcmd -db 16S_ribosomal_RNA \ -entry_batch - \ -out seq.fasta

The problem here i have names for 400 genomes names in the file but in the end I am able to extract sequences for only 200. I did check why this is happening basically some of the enteries of genomes are missing in this ncbi database which inturn is missed in grep -F step. So, the question is, is it normal that the 16S database from NCBI is missing 16S regions entries for some of the genomes?

0
Entering edit mode

16S rRNA blast database indexes are representative and curated, i.e. they do not contain every genome available in NCBI database.

0
Entering edit mode

I guess the easiest way would be to download their annotated genomes with eutils and extract the 16S regions. Maybe Silla couldbe useful but I'm not sure it links to the genome refseq IDs.

0
Entering edit mode

Thank you for the reply. By downloading the whole genome and then extracting 16SrRNA would not be over engineer? Do you know a way where i can just extract only 16S rRNA. I am not sure if Silva contain information for all genomes I am working with but I am sure that NCBI (genbank /refseq) have these genomes.

Cant we directly extact 16S from eutils? if yes then how?

Many thanks!

0
Entering edit mode

I don't know since it's assemblies. I'm not sure if you can download just the .ffs files.