After I used fastqc, the final results appeared in this way. My question here is, how do I know if my readings are good enough and ready for the next stage (alignment), and should I make all the marks in the image green, and how do I use trimmomatic to get good results?
I use this bash script
# run trimmomatic to trim reads with poor quality
java -jar /usr/share/java/trimmomatic-0.39.jar SE -threads 4 /home/tc/Project/Reads/ERR1880946.fastq.gz /home/tc/Project/trimm/ERR1880946_new.fq.gz ILLUMINACLIP:/usr/share/trimmomatic/TruSeq3-SE.fa 2:30:10 LEADING:3 TRAILING:10 -phred33 SLIDINGWINDOW:4:15 MINLEN:33
echo "Trimmomatic finished running!"
and the result , my script is it right for my data or not?
What kind of sequence library is this data from? bulk RNA-seq? What protocol was used to generate the library? These details are important for know how to interpret your FastQC results.
For now I will assume single-end bulk RNA-seq as that seems the best guess from your trimmomatic command details. A couple of thoughts:
The pre-trimmomatic results aren't all that bad aside from the few cycles in which the quality score drops considerably.
I myself have rarely ever used LEADING or TRAILING when running trimmomatic. How about trying only ILLUMINACLIP:/usr/share/trimmomatic/TruSeq3-SE.fa 2:30:10 -phred33 SLIDINGWINDOW:4:15 MINLEN:50. This will mostly clean up the 3' end of the reads while keeping longer reads. Longer reads may map more accurately than shorter reads and thus may negate the drop in quality at cycles 19,20 and 49,50
how do I know if my readings are good enough and ready for the next stage (alignment), and should I make all the marks in the image green
no, the quality scores don't need to be completely in the green, particularly at the 3' end of the read where quality "naturally" decreases
knowing whether the results are good enough depends on the type of data you have and what analyses you want to do with it
To my knowledge there is no one rule on trimming, it really depends on your requirements and questions. You don't give any additional context, so we cannot see if these parameters are appropriate. For example, are primers still attached, are these paired end, are the pairs meant to overlap?
Firstly, I recommend reading into error rates associated with phred scores, and understanding what each threshold means for your analyses. And whilst I could give you my recommendations here, I instead strongly suggest you first read papers with similar data to your study to see what QC they do, and use similar thresholds. Since these are NCBI accessions, what did the original study do?