Hello everyone!
I have a problem with my work on contigs. I have searched the literature but have not found anything useful yet.
First, total DNA was isolated from a soil sample, then mechanically fragmented (sonication) and sequenced (Illumina). The raw data was then filtered using fastp and assembled de novo using Megahit.
The question is - what is the relationship between the number of contigs and the number of sequences before sequencing? I assume that the number of contigs is not the exact number of sequences before sequencing (e.g. damaged DNA during fragmentation, assembly problems, sequence similarities, etc.). Is it even possible to predict such information? However, I feel that I am losing important information about the relationship between bacterial taxa.
I have also done taxonomic classification (kraken2) of raw reads and contigs and the results are very different!
Thank you in advance for your help.
Not sure what you mean by relationship but if you are asking if one can predict the number of contigs one would get then the answer is no. Assembly would be dependent on quality and complexity of libraries. For a metagenome sample there is no way to know the ground truth.