Beginner pipeline to compare SSU in sample to tree of SSUs
1
0
Entering edit mode
2.3 years ago
suzuBell ▴ 60

I have raw paired-end reads (not yet aligned) that may be bacteria/archaea but could also be eukaryote. I would like to determine their small ribosomal subunit rRNAs in the sample and then compare it to SSU tree that comprehensively spans prokaryotes/eukaryotes/etc. I have basic command line skills.

I am thinking of using RNAmmer or Barrnap just because their vignettes are on the shorter side and make me feel like I can accomplish this analysis a bit more time-sensitive manner for preliminary results. I am having difficulty figuring out a simple pipeline to accomplish this task for two reasons:

1) I am unsure if some of the software (like RNAmmer and Barrnap) can take as input raw paired-end .fastq files. And if not, how to prepare an appropriate input in a straight-forward fashion.

2) How to take the output from RNAmmer and Barrnap (which I believe will tell me the SSU in my sample) and then compare it to comprehensive SSU tree to get a better idea of where my sample fits phylogenetically with other organisms.

ssu rnammer barrnap phylogeny • 1.1k views
1
Entering edit mode

RNAmmer and Barrnap are used to predict ribosomal genes in full genomes - is this WGS data? Do you plan to assemble these reads?

0
Entering edit mode

Yes it is WGD data. I did assemble the reads using spades. Would the resulting contigs be suitable for input to RNAmmer and Barrnap (if raw reads are not?)

2
Entering edit mode

Sure. To be fair, I got way less lucky than Mensur when attempting to use 16S sequences from pretty much any illumina short read bacteriel assembly. My assemblies broke frequently exactly at the 16S locus, while 5S and 23S made it in one or the other contigs.

I used SortMeRNA and attempted some denoising and overlap assembly with vsearch and flash but didn't push very far. Eventually I classified the sorted reads using the RDP classifier. This certainly was no "publication ready" result, but I ran out of time and it gave me a ball park estimate of what was in the soup.

1
Entering edit mode

An ounce of luck is often better than a pound of skill - convert to SI metrics as appropriate :-)

In one of the recent metagenomes I've looked at there are at least 70-80 (sub)species, and we were able to assemble ~30 of them as essentially complete genomes (>90% completeness and <5% contamination). Still, this would be an exception rather than a rule for complex communities, and it had little to do with skill beyond careful DNA extraction.

4
Entering edit mode
2.3 years ago
Mensur Dlakic ★ 15k

I have basic command line skills.

This statement means different things to different people, so I will err on the side of presenting you all the options I know about.

1) Barrnap and RNAmmer work on assembled genomes/contigs, so they will not work with raw sequencing reads. There is no straightforward way to prepare .fastq files for Barrnap and RNAmmer - the reads need to be assembled first. I suggest SPAdes as it is relatively easy to install and use. Make sure to use the --meta option if this is a metagenomic sample. Even though I will offer you options below for doing the task without de novo assembly, I still suggest you assemble the reads as the two programs you intended to use will have better sensitivity. Besides, you will likely need to assemble this at some point anyway.

There are programs that can identify and assemble SSU reads directly from metagenomic data. I have used SortMeRNA and it works, but it is fairly old. Other similar programs that I haven't tried are phyloFLASH and Metaxa2. I think this approach is OK if SPAdes runs into memory problems or if you quickly want to get a general idea about the composition of your sample, but in my experience it is much better to search for 16S rRNA in assembled data.

2) There is no straightforward way to do this either. There have been machine learning attempts at training computers to add sequences to existing trees, but I don't know how reliable that approach is in general. A conventional way is to select a good set of representative 16S rRNA sequences from all 3 kingdoms of life, add your sequences to that group, align and trim the alignment, and build a tree. I recommend SSU-ALIGN for aligning 16S rRNA sequences. It can identify SSU rRNA sequences as well from all 3 kingdoms of life.

A shortcut before building a tree is to BLAST your SSU rRNA sequences against nucleotide database. Unless your sequences are very unique, it should give you a solid indication of what is in your sample.

0
Entering edit mode

Thank you for your helpful support here. This may be a long shot, but do you know of any online tutorial that shows how to "select a good set of representative 16S rRNA sequences from all 3 kingdoms of life [my note: or really any set], add your sequences to that group, align and trim the alignment, and build a tree"? It seems the aligning is done using SSU-ALIGN, what about selecting representative 16S rRNA sequences and/or actually building the tree? Sorry if this is a naive question.

1
Entering edit mode

I don't know if there is tutorial, but I would do this by inspecting a tree of life and picking representatives from all major (sub)branches. The file below may be helpful when it comes to Archaea:

https://static-content.springer.com/esm/art%3A10.1038%2Fs41564-018-0163-1/MediaObjects/41564_2018_163_MOESM1_ESM.pdf

If you go to Figure 3F, there is a 16S rRNA tree with ~100 Archaeal species that covers all major taxonomic groups. A hundred species may still be too many for you, so you can pick couple of representatives from each branch. For example, I think you'd be fine by going with Metallosphaera sedula and Sulfolobus solfataricus out of 9 species from the top that are on the same branch. You can shrink the numbers considerably by applying similar logic to the rest. Figure 3D in the same file has 10 Eukarya and 10 Bacteria, but I don't know that those would necessarily be representative of whole kingdoms as they were simply outgroups in our study.