Good morning everyone,
This is my first question in Biostars, so please bear with me if it is not well structured.
I am currently processing raw Illumina paired end reads to proceed with the assembly of 96 chloroplast genomes.
Whole genome shotgun libraries were prepared using the NEBNext Ultra II FS DNA Library Prep Kit for Illumina combined with NEBNext Dual Index Primers Set II, following the manufacture protocol, with an input of 200 ng of DNA, and 25 min of fragmentation time aiming to obtain fragments ranging 150-350 bp.
We then sent the 96 libraries to be sequenced with Illumina HiSeq X 150x150 bp and received confirmation that the fragment sizes of the pooled library was within the expected range.
I then checked the quality of the raw reads with FastQC and everything seems alright, except for the presence of Illumina adapters. Here a snapshot of the FastQC report for the forward read and for one sample:
After this I decided to proceed to Trimmomatic and filter the reads by average quality, keeping an average quality of 20. Here is the command I used:
java -jar /home/bland/Programs/Trimmomatic-0.38/trimmomatic-0.38.jar PE -phred33 -trimlog Trimlog L01_F.fq L01_R.fq L01_PF9.fq L01_UF9.fq L01_PR9.fq L01_UR9.fq ILLUMINACLIP:adapterIL.fa:2:30:10 AVGQUAL:20
For the command above I customized an adapter file, but I also ran the same command using the adapter file TruSeq3-PE-2.fa which is provided by Trimmomatic and the results were the same.
After running the command, most of the reads are retained, but around 80% are visibly shortened:
This means that when I add a MINLEN filter to the Trimmomatic command, let's say to retain reads that are 130-150bp long, I will retain only 20% of my initial raw reads.
Example of command:
java -jar /home/bland/Programs/Trimmomatic-0.38/trimmomatic-0.38.jar PE -phred33 -trimlog Trimlog L01_F.fq L01_R.fq L01_PF9.fq L01_UF9.fq L01_PR9.fq L01_UR9.fq ILLUMINACLIP:adapterIL.fa:2:30:10 AVGQUAL:20 MINLEN:130
I isolated those short reads left after trimming using AVGQUAL only as filter. There seem to be some contamination from Illumina primers within the reads, meaning that before and after the primer sequence I still get some more bases, but this is not for all reads in this group.
Did someone have a similar problem and what could be the cause?
I am not a biologist by formation and it is the first time that I work with this kind of data. I think I am missing out something that could be going on during the sequencing process. Maybe primer dimers were formed or the insert size was shorter than 150bp?
Any help is appreciated, Beatrice