Papers/posts on best practises for dealing with Illumina reads?
22 months ago
DNAngel ▴ 240

Hi All,

I am working with genome assembly data that I need to align to a reference genome, but will specifically be working with the unmapped reads to figure out what those sequences might be mapping to (i.e. environmental DNA potential here).

I've only ever worked with exome sequencing and for very specific genes so I've never done a whole genome assembly analysis. I am familiar with BWA-MEM, FASTQC, Trimmomatic, and SAMtools. I was wondering what are the best practises in terms of cleaning up Illumina seq data, protocols for working with unmapped reads for metagenomic analysis etc. Any good papers you can recommend?

I've finished FASTQC on my reads and found that there is some duplication but cannot find a way to remove the duplicates using trimmomatic. I also don't know if I need to remove the reads as I've seen some posts here that suggest that I may end up losing valuable information. So I got interested in best practises with evidence based testing.

Cheers!

I am working with genome assembly data that I need to align to a reference genome, but will specifically be working with the unmapped reads to figure out what those sequences might be mapping to (i.e. environmental DNA potential here).

What exactly are you trying to do? Is this metagenomic data? Based on rest of your post it still appears to be in fastq read format so where is the "assembly" coming from? I would not worry about any duplicates at this point. Just clean the reads to remove any adapter sequences and that should be it. For that you can use bbduk.sh (GUIDE), fastp or trimmomatic.

Right I am only at the cleaning stage right now because the servers were being slow but I have been looking at protocols and getting ready to come up with a streamlined method for working with metagenomic data. My goal is to assemble the unmapped reads and hopefully detect some cool species that they might belong to. But I am a new hire so I really want to ensure that from the very beginning I am doing everything correctly. My methods for exome sequence data working with just coding sequences was much different - for example I never had to work with (or even keep for myself) the unmapped reads lol. So I don't even know how to go about that now ;D

Are you interested in metagenomic eukaryotes or prokaryotes? For prokaryotes, you can use SPAdes (LINK) with the --meta` option to assemble.

It would be for both - I'm expecting to detect some bacteria (mostly), viruses, and other eukaryotes that would be present in the environment. I have heard of SPades before, never used it though. I hope it's user friendly...

I think this is close to what you are looking for? https://academic.oup.com/bioinformatics/article/32/7/961/2240308

Hmmm I am not sure this will be applicable - I will probably have to do some de novo assembly since I will be using all the unmapped reads to then create their own contigs and figure out what species they might be. This was metagenomic DNA so we are expecting some other non-target species to have been sequenced.