I am working with genome assembly data that I need to align to a reference genome, but will specifically be working with the unmapped reads to figure out what those sequences might be mapping to (i.e. environmental DNA potential here).
I've only ever worked with exome sequencing and for very specific genes so I've never done a whole genome assembly analysis. I am familiar with BWA-MEM, FASTQC, Trimmomatic, and SAMtools. I was wondering what are the best practises in terms of cleaning up Illumina seq data, protocols for working with unmapped reads for metagenomic analysis etc. Any good papers you can recommend?
I've finished FASTQC on my reads and found that there is some duplication but cannot find a way to remove the duplicates using trimmomatic. I also don't know if I need to remove the reads as I've seen some posts here that suggest that I may end up losing valuable information. So I got interested in best practises with evidence based testing.