I'm new to RNA-seq analysis and I'm trying to make sure that I'm following the best pre-processing for my whole transcriptome sequences. I've been reading up on the pros/cons of trimming RNA-Seq data and I'm finding that some recommend trimming with parameters such as Q20-30, however others call this type of trimming "aggresive" and says this can significanty affect mapping and therefore the downstream analysis. Others recommend if aggresive trimming is carried out, then a minimum length should be used (e.g. L36) and others recommend gentle/no trimming for RNA seq data. I understand there is no "best recipe" for trimming data as it's dependant on the sequence itself, so below I've given a general outline of my FastQC output for my 9 samples and I was hoping I could have some recomendations/pointers?
My R_1 sequences have more warnings and fails than my R_2 sequences. R1 generally fails or has warnings on per base sequence content, sequence duplication levels, overrepresented sequences and Kmer Content. Theres the typical illumina 3' end read decline in quality and 5' end dip in quality due to random hexamer priming (I'm guessing). In one sample I do have some contamination that's come up as an overrepresented sequence which when blasted brings up "Synthetic construct RNA control ERCC-00004 that obviously needs to be removed, all other overrepresented sequences appear to be transcription-related. All of my samples fail on Kmer content and either fail or give warning per-base sequence content. R2 generally has the same issues as R1 except theres no overrepresented sequences and the per base sequence quality tends to always remain in the "green zone" with phred scores of 28+ despite the 3' and 5' dip. All sequences pass on "Adaptor content" but I can see on the chart that there is a small amount (about 2%) of illumina universal adaptor sequence present from about 50bp onwards.
Just to see what would happen without trimming I mapped my reads to a reference genome using TopHat and got some pretty poor results back ( 70.1% overall read mapping rate and 50.2% concordant pair alignment rate). I understand rRNA contamination can cause some problems, so I used SeqMonk to check the TopHat output and it didn't detect any rRNA sequences. I understand using "CollectRNASeqMetrics" from picard tools would also be a good idea but I'm struggling to find a way to convert the genome annotation file to the required refFlat format as I'm working with a non-UCSC genome (A.thaliana).
Sequence info: Illumina HiSeq PE reads 2X100bp
I was thinking of using Trimmomatic for any trimming. I know I need to remove that contaminant sequence and the small amount of adaptor sequence but beyond that I'm a little unsure of the most appropriate course of action and would very much appreciate any advice.
Thank you very much in advance,