Is it appropriate to include singleton reads with paired-end datasets for low-coverage Anaplasma marginale comparative genome analysis?
1
0
Entering edit mode
14 days ago

I am currently working on a comparative whole-genome analysis of Anaplasma marginale using paired-end sequencing data. Due to the intracellular nature of the bacterium, I performed host decontamination using Bowtie2 against a ruminant genome index. After filtering, the remaining read count aligned to the A. marginale reference genome is quite low I got 30,605 paired reads and 37,268 singletons (one mate aligned, the other did not).

As My paired-end data contains a significant number of singleton reads , alongside a relatively small number of properly paired reads and I plan to proceed with downstream steps such as genome assembly, annotation (e.g., with Prokka), and pangenome analysis using tools like Roary, I’m considering whether including these singleton reads would be beneficial or introduce biases/errors? Are there known caveats or best practices when incorporating singleton reads for bacterial genome comparative analysis?

Any insights or references would be highly appreciated. I want to make the most of the data I have while maintaining methodological soundness.

assembly analysis coverage genome singletons comparative • 696 views
ADD COMMENT
0
Entering edit mode

I want to make the most of the data

While that desire is understandable, mixing single and paired-end reads is not something all programs support. Make sure the assembler you intend to use can accept a mixed input such as this.

My suggestions is also to try and fish out reads that map to published Anaplasma marginale genome https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000020305.1/ and then use them to assemble. You can use bbduk.sh from BBMap suite in filter mode to do this.

ADD REPLY
0
Entering edit mode
13 days ago
Mensur Dlakic ★ 29k

Most people will tell you to ditch the single reads - GenoMax kind of already did. That's because in most cases there are plenty of paired reads available. A quick calculation with your paired reads, assuming standard Illumina sequencing, gives just over 9 million bases. That's not even 10x coverage, so it is unlikely that you will create a decent assembly from it. The same goes even if you add single reads, which I recommend you do. There are programs (SPAdes, MEGAHIT) that will work with both paired and single reads, so that doesn't strike me as a problem. The bigger problem will be that you would have < 15x coverage even with single reads.

What I am about to suggest is a highly unpopular option, and just about nobody else will ever recommend it on this forum. My suggestion is to assemble everything without removing any reads. From the total assembly it should be easy to separate the microbe contigs from the host using tetranucleotide frequencies. It may not give you any better results and the assembly will certainly take longer, but from where I stand you have nothing to lose. You'd be definitely complicating things unnecessarily for the assembler if the two organisms in questions were even remotely close, but that's not the case here.

ADD COMMENT

Login before adding your answer.

Traffic: 1711 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6