I try to create a pipeline in bash script for whole genome sequencing (in order to obtain the whole/core genome MLST). It is not metagenomic.
But I am a little confused about steps of the data trimming... I have seen a lot of contradictions
Originally my workflow was : FastQC > Confindr > Kraken2 > Trimmomatic (trim adapters + quality) > FastQC > SPAdes. But I have sometimes seen (principally in Biostars) some people using the normalization of reads (not always) with bbnorm or trinity for example.
So, I firstly would like know if the normalization of reads should be constantly applied ? In my paired end files I have always the same number of reads between R1 and R2. The same after Trimmomatic when i keep the paired files R1 et R2 after trimming.
If yes, it's before or after trimmomatic step (adapters removed + quality trimming) ?
And finally if it's necessary, bbnorm can be used in a bash script ?
Thank you for your future answers,
Why do you want to use bbornm ? In SPAdes, there is an error-correction parameter, you can normalized the reads you have using that parameter. If you talk about unequality of paired-end reads, you can use makepair to equalize them. Also, Trinity is de novo assembly tool for RNA-seq data. I do not think It is used for normalization of WGS reads.
Thank you for your reply ! For SPAdes I just use the --careful option (for missmatch) with the input of R1 and R2 reads files. For the error correction you mean --isolate option ? In SPAdes tutorial the author said that's not compatible with --careful option ? So I don't know if it's better to use this to replace the --careful option
Have you seen
If you are sequencing entire genome then you are not doing MLST.