I've been doing metagenomics analyses for two collaborators over the past several months. The samples come from a variety of agricultural sources, but all are 250 nt paired end reads from MiSeq instruments. I have been doing the analyses with a combination of Qiime 1, Qiime2 and Phyloseq. The Qiime 1 workflow uses FLASH as an initial step.
Most of the samples were amplified with the 515F/826R primers for the V3/V4 regions of Bacterial 16s rRNA, but I have ad-hoc evidence that in some of these "bacterial" samples, the predominant signal is from a ~ 550 amplicon from a Fungal 18s rRNA, specifically some Pichia species. These reads don't extend with FLASH, I can't align them to the Lactobacillus plantarum genome (which is otherwise a major component of the other samples) with bowtie, but they do align to some Pichia genomes with bowtie.
What's a reasonable, automatable way to show what organism these non-overlapping reads come from? I don't like cherry picking Pichia and bowtie as a counterscreen. I can show that ~ 70% of the reads come from Pichia and ~ 20% come from Lactobacilli with rtax and the all-domain SSU SIlva database, but the team doesn't want to buy the 64 bit version of usearch, so I have to do some ugly chunking. Obviously I can blast all of the FASTQ reads against NR or RefSeq, but in the absence of pre-clustering, that takes a lot of CPU hours, and it looses the pairing information (right? am I missing something stupid here?)