Question

Beginner pipeline to compare SSU in sample to tree of SSUs

0

Entering edit mode

5.7 years ago

suzuBell ▴ 60

I have raw paired-end reads (not yet aligned) that may be bacteria/archaea but could also be eukaryote. I would like to determine their small ribosomal subunit rRNAs in the sample and then compare it to SSU tree that comprehensively spans prokaryotes/eukaryotes/etc. I have basic command line skills.

I am thinking of using RNAmmer or Barrnap just because their vignettes are on the shorter side and make me feel like I can accomplish this analysis a bit more time-sensitive manner for preliminary results. I am having difficulty figuring out a simple pipeline to accomplish this task for two reasons:

1) I am unsure if some of the software (like RNAmmer and Barrnap) can take as input raw paired-end .fastq files. And if not, how to prepare an appropriate input in a straight-forward fashion.

2) How to take the output from RNAmmer and Barrnap (which I believe will tell me the SSU in my sample) and then compare it to comprehensive SSU tree to get a better idea of where my sample fits phylogenetically with other organisms.

Any advice would be so very helpful.

ssu rnammer barrnap phylogeny • 2.7k views

ADD COMMENT • link updated 5.7 years ago by Mensur Dlakic ★ 29k • written 5.7 years ago by suzuBell ▴ 60

1

Entering edit mode

RNAmmer and Barrnap are used to predict ribosomal genes in full genomes - is this WGS data? Do you plan to assemble these reads?

ADD REPLY • link 5.7 years ago by Carambakaracho ★ 3.3k

0

Entering edit mode

Yes it is WGD data. I did assemble the reads using spades. Would the resulting contigs be suitable for input to RNAmmer and Barrnap (if raw reads are not?)

ADD REPLY • link 5.7 years ago by suzuBell ▴ 60

2

Entering edit mode

Sure. To be fair, I got way less lucky than Mensur when attempting to use 16S sequences from pretty much any illumina short read bacteriel assembly. My assemblies broke frequently exactly at the 16S locus, while 5S and 23S made it in one or the other contigs.

I used SortMeRNA and attempted some denoising and overlap assembly with vsearch and flash but didn't push very far. Eventually I classified the sorted reads using the RDP classifier. This certainly was no "publication ready" result, but I ran out of time and it gave me a ball park estimate of what was in the soup.

ADD REPLY • link 5.7 years ago by Carambakaracho ★ 3.3k

1

Entering edit mode

An ounce of luck is often better than a pound of skill - convert to SI metrics as appropriate :-)

In one of the recent metagenomes I've looked at there are at least 70-80 (sub)species, and we were able to assemble ~30 of them as essentially complete genomes (>90% completeness and <5% contamination). Still, this would be an exception rather than a rule for complex communities, and it had little to do with skill beyond careful DNA extraction.

ADD REPLY • link 5.7 years ago by Mensur Dlakic ★ 29k

score 5 · Accepted Answer · 2019-10-13

I have basic command line skills.

This statement means different things to different people, so I will err on the side of presenting you all the options I know about.

1) Barrnap and RNAmmer work on assembled genomes/contigs, so they will not work with raw sequencing reads. There is no straightforward way to prepare .fastq files for Barrnap and RNAmmer - the reads need to be assembled first. I suggest SPAdes as it is relatively easy to install and use. Make sure to use the --meta option if this is a metagenomic sample. Even though I will offer you options below for doing the task without de novo assembly, I still suggest you assemble the reads as the two programs you intended to use will have better sensitivity. Besides, you will likely need to assemble this at some point anyway.

There are programs that can identify and assemble SSU reads directly from metagenomic data. I have used SortMeRNA and it works, but it is fairly old. Other similar programs that I haven't tried are phyloFLASH and Metaxa2. I think this approach is OK if SPAdes runs into memory problems or if you quickly want to get a general idea about the composition of your sample, but in my experience it is much better to search for 16S rRNA in assembled data.

2) There is no straightforward way to do this either. There have been machine learning attempts at training computers to add sequences to existing trees, but I don't know how reliable that approach is in general. A conventional way is to select a good set of representative 16S rRNA sequences from all 3 kingdoms of life, add your sequences to that group, align and trim the alignment, and build a tree. I recommend SSU-ALIGN for aligning 16S rRNA sequences. It can identify SSU rRNA sequences as well from all 3 kingdoms of life.

A shortcut before building a tree is to BLAST your SSU rRNA sequences against nucleotide database. Unless your sequences are very unique, it should give you a solid indication of what is in your sample.