I have raw paired-end reads (not yet aligned) that may be bacteria/archaea but could also be eukaryote. I would like to determine their small ribosomal subunit rRNAs in the sample and then compare it to SSU tree that comprehensively spans prokaryotes/eukaryotes/etc. I have basic command line skills.
I am thinking of using RNAmmer or Barrnap just because their vignettes are on the shorter side and make me feel like I can accomplish this analysis a bit more time-sensitive manner for preliminary results. I am having difficulty figuring out a simple pipeline to accomplish this task for two reasons:
1) I am unsure if some of the software (like RNAmmer and Barrnap) can take as input raw paired-end .fastq files. And if not, how to prepare an appropriate input in a straight-forward fashion.
2) How to take the output from RNAmmer and Barrnap (which I believe will tell me the SSU in my sample) and then compare it to comprehensive SSU tree to get a better idea of where my sample fits phylogenetically with other organisms.
Any advice would be so very helpful.
RNAmmer and Barrnap are used to predict ribosomal genes in full genomes - is this WGS data? Do you plan to assemble these reads?
Yes it is WGD data. I did assemble the reads using spades. Would the resulting contigs be suitable for input to RNAmmer and Barrnap (if raw reads are not?)
Sure. To be fair, I got way less lucky than Mensur when attempting to use 16S sequences from pretty much any illumina short read bacteriel assembly. My assemblies broke frequently exactly at the 16S locus, while 5S and 23S made it in one or the other contigs.
I used SortMeRNA and attempted some denoising and overlap assembly with vsearch and flash but didn't push very far. Eventually I classified the sorted reads using the RDP classifier. This certainly was no "publication ready" result, but I ran out of time and it gave me a ball park estimate of what was in the soup.
An ounce of luck is often better than a pound of skill - convert to SI metrics as appropriate :-)
In one of the recent metagenomes I've looked at there are at least 70-80 (sub)species, and we were able to assemble ~30 of them as essentially complete genomes (>90% completeness and <5% contamination). Still, this would be an exception rather than a rule for complex communities, and it had little to do with skill beyond careful DNA extraction.