Hi there!
I'm developing a pipeline for SNP calling between different genomes of the same specie. However, I've found different opinions regarding the trimming step.
My logic tells me that applying hard trimming is preferable to avoid obtaining false positive SNP calls. Instead, many pipelines suggest that soft trimming or no trimming at all is better:
- https://www.nature.com/articles/hdy2016102
- https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/data-preprocessing/
- https://www.melbournebioinformatics.org.au/tutorials/tutorials/variant_calling_gatk1/files/VariantCallingUsingGATK4.pdf
If trimming is recommended, which type of trimming is best suited: soft or hard trimming? I use Trimmomatic
with these options:
java -jar /mnt/home/soft/trimmomatic/programs/x86_64/0.39/trimmomatic-0.39.jar PE -threads 64 -phred33 $seq1 $seq2 $seq_tfp $seq_tfu $seq_trp $seq_tru ILLUMINACLIP:0_index/adapters/TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
If hard trimming is necessarily, I would add the HEADCROP:15
and/or TAILCROP:15
options.
For instance, my sequences are DNA Illumina Raw Reads. Therefore, they exhibit variations in the first 10-12 nucleotides, some adapters, and generally good quality.
Thank you so much!