How to identify 16s sequences from binning data(contigs)?
4
5
Entering edit mode
7.0 years ago
liuyifan2014 ▴ 80

Hi everyone, I got some binning data and I want to retrieve 16s sequences from them. The binning data are supposed to be pure single genome, but each of them is composed of several contigs. So it may cause problem if I submit them to 16sr RNA identifier like RNAmmer which requires single genome sequence file. Do you know any program serves this purpose? Many thanks:)~

genome Assembly sequence blast • 11k views
13
Entering edit mode
7.0 years ago

BBDuk works well for this purpose, if you have a large set of curated ribosomal sequences (such as Silva). You can run it on the reads or the assemblies.

bbduk.sh in=data.fa outm=ribo.fa outu=nonribo.fa k=31 ref=silva.fasta


Though, I'm not quite clear why you want to remove the 16S.

0
Entering edit mode

Thank you. I don't want to remove them, I just want to extract them for phylogenetic classification.

6
Entering edit mode

I put a link to a set of ribosomal kmers on Google drive:

I made it mainly from Silva. It's small (9MB) and you can use it with BBDuk like this:

bbduk.sh in=data.fa outm=ribo.fa outu=nonribo.fa k=31 ref=ribokmers.fa.gz


It has roughly 99.94% sensitivity against the full Silva database.

1
Entering edit mode

This works quite nice! I would like to build the ribokmer reference also for other genes. I tried it with kmercountexact from bbmap but I can not replicate the file provided on google drive. Can you shed some light on how to this reference set of ribosomal kmers? Thanks!

1
Entering edit mode

The process was a little involved. I started with the Silva ribosomal database, and followed this procedure:

1. Deduplicated the sequences with dedupe.sh, since there are lots of redundant copies.
2. Ran them through kcompress.sh, which produces a fasta file containing all the kmers of interest. I made several different versions; one containing all 31-mers, one containing only 31-mers that occurred at least 2 times, one for 3 times, etc. up to 50 times (there is a flag for kcompress which specifies this). That allows variable sensitivity/specificity; the 1-copy version is the biggest and most sensitive, while the 50-copy version is tiny and tends to only contain the most highly conserved ribosomal kmers (or the ones for the organisms that are most popular to sequence so they are in the database a lot).
3. Then I generated synthetic data from Silva and tested the sensitivity of the different versions. The point of this was to salvage kmers that were important but missing. So, for example, I ran BBDuk with the 10-copy kmer set and kept the reads that did not match, and added in some of the most common kmers from those nonmatching reads; this keeps the file size small but increases sensitivity. This step is not really necessary, though - you can just do 1 and 2 in general, but this is what I did.

The file on my google drive is, I think, the version in which I kept only kmers present at least 3 times in the deduplicated Sliva database.

0
Entering edit mode

Cool,have you got documentation or publication for this tool?

0
Entering edit mode

There's no paper yet. Documentation is in the shell script (it is printed if you run it with no arguments). There's also a thread here explaining common uses.

Edit: There is also now a BBDuk usage guide in the /docs/ directory.

0
Entering edit mode

Sorry to resurrect a thread. Would you happen to have the information regarding organisms that each of those reads map to? @Brian Bushnell

1
Entering edit mode
7.0 years ago
5heikki 10k

ssu-align is pretty great for identifying 16S and 18S

0
Entering edit mode

Thank you. But ssu-align is specific designed for PCR-based SSU sequences. My data is draft genome data , most of which I think are non-SSU sequences. What's more, the SSU sequences in my file are discrete which my be divided in to many contigs or may be just little piece of a big contige. Do you have any idea about these problems? I will try it anyway :)

1
Entering edit mode

It's hmm based general search tool. I have used it for screening 16S and 18S from metagenomic assemblies. At least in my assemblies, the 16S fragments are generally in the ends of contigs. I haven't looked into it further, but I think it's due to too high sequence similarity (same k-mers) in very conserved regions of the molecule.

0
Entering edit mode

It works, thank you!

1
Entering edit mode
7.0 years ago
h.mon 34k

hmm_RNA uses hidden Markov models to find SSU and LSU on metagenomic assemblies. It is sensitive, finding even small fragments on draft assemblies, but its identification is inaccurate, calling "Archeal" or "Eukariotic" several bacterial sequences.

0
Entering edit mode

It works, thank you!

1
Entering edit mode
16 months ago
O.rka ▴ 620

I might be a little late to the game but since 2018, BARRNAP made by Torsten Seemann, the guy who made Prokka, https://github.com/tseemann/barrnap has been my GO-TO. He writes really great software.