Question

What if the sequence length distribution differ sharply after I use fastp to trim my sequence data?

0

Entering edit mode

4.0 years ago

ljj_2016 • 0

Firstly, I use fastqc to check the quality and I found the Per Base Sequence Content section was very poor, and then I used fastp to do trimming as follow:

fastp -i con10.1.fq.gz -I con10.2.fq.gz -o ./QC/con10.1.fq.gz -O ./QC/con10.2.fq.gz -h con10.html -j con10.json

but when I did fastqc to the data after trimming, I found the sequence length distribution turned out so bad!! The read length were supposed to be 150bp, but there were two broad peaks in the sequene length plot, one was near 60bp and another was near 149bp. Is this normal and can I still use this kind of data?

sequencing ChIP-Seq genome • 2.2k views

ADD COMMENT • link 4.0 years ago by ljj_2016 • 0

1

Entering edit mode

First, Per Base Sequence Content represents the proportion of ATCG at each base of the sequences, I don't know why you directly trimming to solve the problem, you should focus on if you have things like adapter contamination or some specific experiment step which will cause this kind of problem.

Second, if you have done trimming, it is common to see some sequences trimmed, so the length distribution will change.

I think you may need to read fastqc manual detailedly first to make sure you can handle each section correctly

ADD REPLY • link 4.0 years ago by Jianyu ▴ 580

0

Entering edit mode

Thank you for your reply, my previous desription may lead to misunderstanding. I actually did not directly "trimming" to solve the problem, I just used the tool fastp to filter my raw data, it can detect and remove adaptor sequences, as well as filtering out some low quality data, reads with too many NNNs, or polyG, etc. I just called this step "trimming". But I don't understand why this step would change the lenghth distribution so greatly. In my assumption, if the read length was suppose to be 150bp but after filtering most of them turned out to be around 60bp, wouldn't that mean this batch of sequencing experiment is unreliable ? I just don't know whether to give up this data or continue to do downstream analysis. And I already read the fastq manual, but some question still haunted me.

ADD REPLY • link 4.0 years ago by ljj_2016 • 0

0

Entering edit mode

Where is the data from? The length distribution after trimming depends on a lot of things, two example

1st, If the most of your DNA fragments (before ligation) are around 60bp, the sequence reads you get will be

original DNA fragment (~60bp) + adapter = 150bp

Then trimming will discard the adapter sequence, then you will only keep 60bp

2nd, Usually, the quality of each base will drop down as the sequencing goes, which mean the last several bases of reads will have a lower quality than those at the begining, I have never used fastp, but trimmomatic (which is another trimming program) will discard the low quality bases at the end.

So if your sequencing quality is not very good, trimming can dramatically change the length distribution since it needs to discard a great part of your read.

ADD REPLY • link 4.0 years ago by Jianyu ▴ 580