Hy everyone when I first did my fastqc I observed that phred score was not very good in my data and especially phred score of R2 was terrible, so I did fastp. When I read the fastp report I noticed that about half of my reads have been discarded, now this makes me worried whether I used the right set of parameters or not? how will this now affect my alignment and subsequent steps?? its impact on gene coverage?? Before filtering the read length was 35-76 bp and now it has reduced to 50-76 bps. let me tell you I used the same fastp command for R1 and R2 of all samples. WHAT SHOULD I DO NOW? SHOULD I PROCEED FURTHER OR CHANGE MY FASTP PARAMETERS?
this was the fastp command that i used
fastp \
-i ${sample}_R1.fastq.gz \
-I ${sample}_R2.fastq.gz \
-o trimmed_fastq/${sample}_R1.trimmed.fastq.gz \
-O trimmed_fastq/${sample}_R2.trimmed.fastq.gz \
--adapter_sequence (provided)
--adapter_sequence_r2 (it was poly-g sequence)
--detect_adapter_for_pe \
--correction \
--trim_poly_g \
--trim_poly_x \
--poly_g_min_len 10 \
--poly_x_min_len 10 \
--cut_front \
--cut_tail \
--cut_right \
--cut_window_size 5 \
--cut_mean_quality 20 \
--qualified_quality_phred 20 \
--unqualified_percent_limit 30 \
--n_base_limit 3 \
--length_required 50 \
--low_complexity_filter \
--complexity_threshold 30 \
--overrepresentation_analysis \
--thread 8 \
--html fastp_reports/${sample}_fastp.html \
--json fastp_reports/${sample}_fastp.json
If the data is bad quality to begin with then it is not going to improve as a whole by passing it through a tool. You could play with parameters and salvage additional data but you will lose the rest anyway.
If this data is non replaceable then go on with what is left over. If it is possible to redo the experiment then sometimes that is the best option.
Thanks for the reply. The experiment cannot be run again so I need to proceed with this data. It would be helpful if you could give me an insight on this fastp command used.
What kind of data is this? Changing default parameters for programs should be done based on an understanding of the characteristics of your data. Removing poly-G's etc is fine but not sure why you are using all the
cut_
parameters. As long as you have a good reference to align to (and you are not going to call SNP's) it may be fine to accept less than perfect Q scores since the sequence may still be usable. If you are not sure why you are changing a particular parameter then stay with program defaults (they should work for most data).You could post FastQC images of before and after states of particular parameters if you want to get informed opinions.
I have RNA-seq data and here I am attaching the fastqc report image of
my reads before and after running this fastp command.
Will refer you back to my comment above. Don't be fixated on the Q scores. You don't say what kind of data this is but it may still align fine and be usable. If there was an issue with sequencing (which led to the original drop in Q scores) then you should consult the sequencing center.
How many reads were there to begin with and how many are left after the severe trimming.
Hi, concerning your read size in your command you've specified to keep reads with a minimum length of 50 bp with the option
--length_required 50 \
and it's kinda to much. so if a lot of your reads have less than 50bp they'll discard (I'll suggest to set it to 20-30 bp).Also the --low_complexity_filter could be unnecessary depending on what type of data you've or your main goal. You've to give more infos to understand your situation and for the
--cut
option maybe that too much trimming, with those options your reads gonna lose a lot of bases and they'll became short (less than 50 pb).So you need to know what was sequenced it could be panel of genes, WES, WGS ... etc
I have attached the phred score image of my reads, your opinion will be appreciated.