Hey, I have some paired-end RNA seq samples with which I want to perform DE analysis (aligning to the human genome). There are multiple samples but for simplicity, I will just take one of them. First, I performed FASTQC on the samples, the results of which are attached below.
Now, there are a total of 26912459 read pairs and all of them have a length of 161 bp (sequence length distribution of R2 not given since it's the same as R1). Since there were adapters (Illumina Universal Adapter) present, I decided to trim them using Trimmomatic.
Initially, I used the following code
trimmomatic PE -threads 16 sample_R1.fastq.gz sample_R2.fastq.gz -baseout sample_trim.fastq.gz ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 MINLEN:30
This gave 4 files as output 2 paired files and 2 unpaired files. I ran FASTQC on all 4 files, the results of which are shown below.
Now, trimmomatic worked great and removed the adapters completely. Also, the sequences have an equal length of 161. But, now the number of sequences in the paired files is 13834556, which is ~ 51.5% of the original reads. The remaining reads seem to have gone to the file 'sample_trim_1up' which is the forward unpaired file containing the orphan reads from R1. FASTQC statistics of this file are shown below.
This file contains 13075592 reads which are ~48.6% of the original reads. Also, the sequence length distribution is extremely wide ( base statistics says it contains reads 30-153 bp in length). Also, adapters are completely removed from this file as well. Of note, the 'sample_trim_2UP' file is empty.
Since I was losing almost half of my reads, I decided to use the parameter keepBothReads as TRUE which (according to what I understood) stops the default behavior of removing palindromic R2 reads and hence orphaning the cognate R1 reads. The new code is as follows
trimmomatic PE -threads 16 sample_R1.fastq.gz sample_R2.fastq.gz -baseout sample_trimnew.fastq.gz ILLUMINACLIP:TruSeq3-PE.fa:2:30:10:8:TRUE MINLEN:30
I took parameter minAdapterlength as default 8 which the manual says is conservative and can be taken as low as 1. Now, once again I get 4 files 2 paired and 2 unpaired. This time, both unpaired files are empty. The fastQC for sample_trimnew_1P is shown below.
This file contains 26910148 reads which are 99.99% of the original reads (the output said 2311 reads were dropped). Here, again the adapters are removed and I get back almost all my reads but the sequence length distribution is varying ( base statistics say it contains 30-161 bp reads).
Here are my questions:
Which of the paired trimmomatic output files would be suitable for downstream alignment using STAR? I hope the 'trimnew' files using keepBothReads as TRUE can be used since I wont be losing any reads.
Why does trimming change of read length distribution of an initially uniform library?
Will a trimmed library of varying sizes (30-161) be suitable for STAR alignment? In particular, STAR uses a parameter called 'sjdbOverhang' during genomeGenerate ( to make the index) which should be 'read length - 1'. For an uneven library, what should this be?
Sorry for the overly long post, thanks in advance