How Should I Deal with Paired-End Shotgun Metagenomic Reads for DIAMOND Analysis?
Entering edit mode
13 months ago
ian.petersen ▴ 10


I'm trying to work with some Illumina shotgun metagenomic reads (2x150bp). I've tried merging both the forward and reverse reads with BBmerge and PEAR but both tools only merge about 30% of the reads at the most.

  1. Would I be right in assuming that this is due to the shotgun shearing producing some larger inserts where the forward and reverse reads never actually overlap?

  2. If this is the case, would there be any benefit to merging the reads before DIAMOND analysis, or would just processing Read 1 and Read 2 separately be preferred?

In a protocol for DIAMOND and MEGAN analysis here, the suggest merging paired end reads using fastq-join (which I assume would give similar results to BBmerge and PEAR) and then concatenating the merged reads as well as the unmerged reads together to ensure all of the data is retained.

  1. What would be the benefit of merging the reads at all if they are just getting combined with the unmerged reads anyway before analysis (other than having a single input file for DIAMOND)?



diamond metagenomics megan paired-end • 760 views
Entering edit mode

Hi ian.petersen

Why don't you assemble the reads into contigs, without worrying too much about merging, and then run the DIAMON/MEGAN pipeline? Having longer sequences you would greatly improve the taxonomic and functional classification

Entering edit mode
13 months ago
h.mon 34k

this is due to the shotgun shearing producing some larger inserts

Yes, shotgun libraries will result in a range of insert sizes, so reads from the shorter inserts will merge, and reads from the longer inserts won't merge.

What would be the benefit of merging the reads

I guess the longer reads will lead to more precise similarity search results, as the longer reads will potentially lead to better, longer alignments, reducing the effect of shorter, spurious hits. However, it is strange the protocol just concatenates the merged and unmerged reads, as this will give more weight to the unmerged reads (potentially being counted twice) in comparison to the merged reads. In practice, it shouldn't matter much, but I could see this leading to biases, as differences in genome characteristics (e.g., GC content) could lead to systematic biases in insert sizes for different organisms.


Login before adding your answer.

Traffic: 1424 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6