I am currently working on a comparative whole-genome analysis of Anaplasma marginale using paired-end sequencing data. Due to the intracellular nature of the bacterium, I performed host decontamination using Bowtie2 against a ruminant genome index. After filtering, the remaining read count aligned to the A. marginale reference genome is quite low I got 30,605 paired reads and 37,268 singletons (one mate aligned, the other did not).
As My paired-end data contains a significant number of singleton reads , alongside a relatively small number of properly paired reads and I plan to proceed with downstream steps such as genome assembly, annotation (e.g., with Prokka), and pangenome analysis using tools like Roary, I’m considering whether including these singleton reads would be beneficial or introduce biases/errors? Are there known caveats or best practices when incorporating singleton reads for bacterial genome comparative analysis?
Any insights or references would be highly appreciated. I want to make the most of the data I have while maintaining methodological soundness.
While that desire is understandable, mixing single and paired-end reads is not something all programs support. Make sure the assembler you intend to use can accept a mixed input such as this.
My suggestions is also to try and fish out reads that map to published Anaplasma marginale genome https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000020305.1/ and then use them to assemble. You can use
bbduk.sh
from BBMap suite in filter mode to do this.