I am trying to error correct shotgun reads so that they will assemble better. Due to memory issues, I can't just do the whole read set, and I was wondering about breaking it into smaller sections, but don't know if this would create different problems.
So in my reads, I have some known species that I have obtained good contigs at 100% ID. Can I use BBduk for example against the reference to pull reads out that match specific species, and error correct them separately, then concatenate everything back together again for de novo assembly??? Then same thing again using Kraken2 to take everything for example that it identifies as "bacteria" or something else.
I was thinking that BBmerge, which I believe works directly on each pair, shouldn't have affect on other reads?? Though Tadpole uses Kmers from all of the reads used for error correction, wouldn't this need to be used on everything I want to eventually assemble de novo, or would it matter if I did this on subsets of say bacteria/fungi/etc ??
Is there a consensus opinion of running BBMerge or Tadpole first??? I had previously seen from the author about running BBMerge error correction before Tadpole, but I've also seen others talk about running Tadpole first.
Tadpole Extend Reads - I have read that this apparently improves assembly using Metaspades, if I use the extend reads option, is it recommended to use BBMerge to error correct the potential overlap after???
Thanks!!
Not exactly an answer to your question, but it still may be useful.
If you are going to assemble with SPAdes in the end, it does error correction so you probably shouldn't do it beforehand.
A small C program for error correction that does an excellent job, is multithreaded, and likely won't cause memory problems:
https://github.com/lh3/bfc
This normally works for me and is very fast, plus can't remember ever having memory issues.