Question

Metagenomics with non-overlapping pairs

0

Entering edit mode

6.3 years ago

mamillerpa ▴ 40

I've been doing metagenomics analyses for two collaborators over the past several months. The samples come from a variety of agricultural sources, but all are 250 nt paired end reads from MiSeq instruments. I have been doing the analyses with a combination of Qiime 1, Qiime2 and Phyloseq. The Qiime 1 workflow uses FLASH as an initial step.

Most of the samples were amplified with the 515F/826R primers for the V3/V4 regions of Bacterial 16s rRNA, but I have ad-hoc evidence that in some of these "bacterial" samples, the predominant signal is from a ~ 550 amplicon from a Fungal 18s rRNA, specifically some Pichia species. These reads don't extend with FLASH, I can't align them to the Lactobacillus plantarum genome (which is otherwise a major component of the other samples) with bowtie, but they do align to some Pichia genomes with bowtie.

What's a reasonable, automatable way to show what organism these non-overlapping reads come from? I don't like cherry picking Pichia and bowtie as a counterscreen. I can show that ~ 70% of the reads come from Pichia and ~ 20% come from Lactobacilli with rtax and the all-domain SSU SIlva database, but the team doesn't want to buy the 64 bit version of usearch, so I have to do some ugly chunking. Obviously I can blast all of the FASTQ reads against NR or RefSeq, but in the absence of pre-clustering, that takes a lot of CPU hours, and it looses the pairing information (right? am I missing something stupid here?)

RNA-Seq metagenomics non-overlapping pairs • 2.0k views

ADD COMMENT • link 6.3 years ago by mamillerpa ▴ 40

1

Entering edit mode

I'm not familiar with FLASH, but if they are amplicon data you could merge with usearch or vsearch and capture the unmerged forward reads, concatenate them with merged reads (trim primers?), then use vsearch to dereplicate (remove duplicates), which should get you down to a much smaller number of sequences to search against a database.

You may want to try AMPtk for processing as it is a QIIME-like replacement but much faster and more flexible. Written for fungal ITS data (variable length amplicons) originally but works with all amplicon data. (disclaimer.... I'm the author). By default AMPtk merges PE illumina data in a similar way to what I mentioned above. http://amptk.readthedocs.io/en/latest/index.html and pre-print https://www.biorxiv.org/content/early/2017/11/03/213470