Question: How to identify 16s sequences from binning data(contigs)?
3
gravatar for liuyifan2014
4.0 years ago by
liuyifan201460
United States
liuyifan201460 wrote:

Hi everyone, I got some binning data and I want to retrieve 16s sequences from them. The binning data are supposed to be pure single genome, but each of them is composed of several contigs. So it may cause problem if I submit them to 16sr RNA identifier like RNAmmer which requires single genome sequence file. Do you know any program serves this purpose? Many thanks:)~

blast sequence assembly genome • 4.8k views
ADD COMMENTlink modified 4.0 years ago by h.mon27k • written 4.0 years ago by liuyifan201460
11
gravatar for Brian Bushnell
4.0 years ago by
Walnut Creek, USA
Brian Bushnell16k wrote:

BBDuk works well for this purpose, if you have a large set of curated ribosomal sequences (such as Silva).  You can run it on the reads or the assemblies.

bbduk.sh in=data.fa outm=ribo.fa outu=nonribo.fa k=31 ref=silva.fasta

Though, I'm not quite clear why you want to remove the 16S...

ADD COMMENTlink written 4.0 years ago by Brian Bushnell16k

Thank you. I don't want to remove them, I just want to extract them for phylogenetic classification.

ADD REPLYlink written 4.0 years ago by liuyifan201460
5

I put a link to a set of ribosomal kmers on Google drive:

https://drive.google.com/file/d/0B3llHR93L14wS2NqRXpXakhFaEk/view?usp=sharing

I made it mainly from Silva.  It's small (9MB) and you can use it with BBDuk like this:

bbduk.sh in=data.fa outm=ribo.fa outu=nonribo.fa k=31 ref=ribokmers.fa.gz

It has roughly 99.94% sensitivity against the full Silva database.

ADD REPLYlink written 4.0 years ago by Brian Bushnell16k
1

This works quite nice! I would like to build the ribokmer reference also for other genes. I tried it with kmercountexact from bbmap but I can not replicate the file provided on google drive. Can you shed some light on how to this reference set of ribosomal kmers? Thanks!

ADD REPLYlink written 3.6 years ago by m.dehollander50
1

The process was a little involved.  I started with the Silva ribosomal database, and followed this procedure:

1) Deduplicated the sequences with dedupe.sh, since there are lots of redundant copies.

2) Ran them through kcompress.sh, which produces a fasta file containing all the kmers of interest.  I made several different versions; one containing all 31-mers, one containing only 31-mers that occurred at least 2 times, one for 3 times, etc. up to 50 times (there is a flag for kcompress which specifies this).  That allows variable sensitivity/specificity; the 1-copy version is the biggest and most sensitive, while the 50-copy version is tiny and tends to only contain the most highly conserved ribosomal kmers (or the ones for the organisms that are most popular to sequence so they are in the database a lot).

3) Then I generated synthetic data from Silva and tested the sensitivity of the different versions.  The point of this was to salvage kmers that were important but missing.  So, for example, I ran BBDuk with the 10-copy kmer set and kept the reads that did not match, and added in some of the most common kmers from those nonmatching reads; this keeps the file size small but increases sensitivity.  This step is not really necessary, though - you can just do 1 and 2 in general, but this is what I did.

The file on my google drive is, I think, the version in which I kept only kmers present at least 3 times in the deduplicated Sliva database.

ADD REPLYlink written 3.6 years ago by Brian Bushnell16k

Cool,have you got documentation or publication for this tool?

ADD REPLYlink written 3.9 years ago by liuyifan201460

There's no paper yet.  Documentation is in the shellscript (it is printed if you run it with no arguments).  There's also a thread here explaining common uses.

Edit: There is also now a BBDuk usage guide in the /docs/ directory.

ADD REPLYlink modified 3.6 years ago • written 3.9 years ago by Brian Bushnell16k
1
gravatar for 5heikki
4.0 years ago by
5heikki8.5k
Finland
5heikki8.5k wrote:

ssu-align is pretty great for identifying 16S and 18S

ADD COMMENTlink written 4.0 years ago by 5heikki8.5k

Thank you. But ssu-align is specific designed for PCR-based SSU sequences. My data is draft genome data , most of which I think are non-SSU sequences. What's more, the SSU sequences in my file are discrete which my be divided in to many contigs or may be just little piece of a big contige. Do you have any idea about these problems? I will try it anyway :)

ADD REPLYlink written 4.0 years ago by liuyifan201460
1

It's hmm based general search tool. I have used it for screening 16S and 18S from metagenomic assemblies. At least in my assemblies, the 16S fragments are generally in the ends of contigs. I haven't looked into it further, but I think it's due to too high sequence similarity (same k-mers) in very conserved regions of the molecule.. 

ADD REPLYlink modified 4.0 years ago • written 4.0 years ago by 5heikki8.5k

It works, thank you!

ADD REPLYlink modified 4.0 years ago • written 4.0 years ago by liuyifan201460
1
gravatar for h.mon
4.0 years ago by
h.mon27k
Brazil
h.mon27k wrote:

hmm_RNA uses hidden Markov models to find SSU and LSU on metagenomic assemblies. It is sensitive, finding even small fragments on draft assemblies, but its identification is inaccurate, calling "Archeal" or "Eukariotic" several bacterial sequences.

ADD COMMENTlink written 4.0 years ago by h.mon27k

It works, thank you!

ADD REPLYlink written 4.0 years ago by liuyifan201460
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 923 users visited in the last hour