Question: why there is many broken paired reads in my Illumina reads and how to solve them?
0
gravatar for seta
4.9 years ago by
seta1.2k
Sweden
seta1.2k wrote:

Hi everybody,

My question is: what is the origin of broken paired reads? I'm challenging with the de novo transcriptome assembly using CLC genomics workbench software. At the first, I did trimming and duplicate read removal and then assembly using this software. My quality of sequencing data (Illumina 100bp, paired read) seems very good, in fact about 97% of reads passed the trimming steps successfully and there was only 1% duplicate read in my total reads. However, my assembly result is not satisfactory (about 330 million reads from 370 million reads was reported as broken paired reads!). I would be highly appreciate if you could let me know why there is many broken reads and how I can solve them? thanks in advance.

rna-seq next-gen assembly • 4.1k views
ADD COMMENTlink modified 4.9 years ago by SES8.2k • written 4.9 years ago by seta1.2k

If you're using a commercial product like CLC, then ask them questions like this. You are paying for support from them after all.

ADD REPLYlink written 4.9 years ago by Devon Ryan92k
0
gravatar for SES
4.9 years ago by
SES8.2k
Vancouver, BC
SES8.2k wrote:

The origin of broken pairs is when one pair is removed (because it is below a quality or length threshold) while the other is not, thus leaving the pairs out of sync. Though, it is not really possible to know the source of the issue without knowing how the trimming was done. If the trimming was done with CLC then I agree with the above comment that it would be best for you to find a solution with the folks at CLC. That way you can keep all your analysis in the same place.

In case you aren't able to find the source of the issue (or a solution), you can use the Pairfq command pairfq makepairs (see the wiki page about that command for more information). The only caveat of this approach for you would be that you have to do some of your analysis in the CLC app, then go the command line to fix your paired-end files, then back to CLC for assembly...but it may solve your current issue.

ADD COMMENTlink written 4.9 years ago by SES8.2k

Thanks a lot for your response. it may be unlikely that one pair is removed because of bad quality, as I said my read quality was enough good. in your opinion, is it possible that it's because of incorrect insert distance? I'll try your command on the trimmed reads, I have almost no problem with linux. 

ADD REPLYlink written 4.9 years ago by seta1.2k

Why do you say it is unlikely that a pair is removed after quality trimming? Almost all Illumina data I have seen is very high quality, but some reads will be removed (less that 1% usually). If you look at your stats after trimming, do the files have the same number of reads? If not, then this could be an issue if you used these reads in an assembly. 

I think it would help if you could clarify what you mean by "broken" reads. I understand in general terms what you mean, but how are you defining "broken" and what is the threshold for falling into this category?

ADD REPLYlink written 4.9 years ago by SES8.2k

Hi SES, thanks a lot for your comments. No, they are not the same number, one of them is 189,616,838 and another is 185,818,436. your mean this slight difference can cause such a big problem?. It is not defined the threshold for defining broken reads, software just reported that the number of broken reads. it's the same general meaning that you got it. therefore, how I can solve this problem? just using your proposed script (Pairfq)?

ADD REPLYlink written 4.9 years ago by seta1.2k
1

A difference of 1, or even 0 if an equal number of reads are asymmetrically removed from each file, is enough to cause big problems. When dealing with paired ends, never use any tool that can't process both files at once (with the exception of tools like FastQC).

ADD REPLYlink written 4.9 years ago by Devon Ryan92k

thanks to clarify me. I have two adapter sequence for each read that I tried to remove them using clc genomics workbench software. Could you please let me know how I can to remove two adapter sequence from paired reads simultaneously?

ADD REPLYlink written 4.9 years ago by seta1.2k

Well, I'd first complain to CLC, since they should provide a trimmer that can handle data like that if they don't already. secondly, common trimmers are skewer, trimmomatic and BBDuk (from BBMap).

However, you can just repair the files you have. BBMap comes with a script repair.sh I think) to do that conveniently.

ADD REPLYlink written 4.9 years ago by Devon Ryan92k

See this thread: Combining The Paired Reads From Illumina Run

ADD REPLYlink written 4.9 years ago by Ashutosh Pandey11k

Thanks Ashutosh, but I would like to know that combining paired reads is common? also, SES introduced Pairfq to solve the problem. Neither I have tried none of them nor I saw in the paper, would you please let me know how reliable they are? is there any document about this issue? thanks so much

ADD REPLYlink written 4.9 years ago by seta1.2k

I referred to the documentation for Pairfq previously, and I would say it is more reliable than other solutions since it is the only tool that has tests. The other solutions are quite hackish and leave your data modified, so I wouldn't say they are reliable in any way. Also, I personally found the shell commands hard to modify for general use and the repetition of commands and then cleaning up afterwards is quite tedious. The BBMap solution requires the input to be interleaved (last I tried it). None of the them are published.

ADD REPLYlink written 4.9 years ago by SES8.2k

I wouldn't follow the advice in that thread...the awk suggestion specifically. It use way too much memory, writes a bunch of intermediate files, involves multiple commands, modifies the read ID, only works on fastq, doesn't write out the singleton reads...

ADD REPLYlink written 4.9 years ago by SES8.2k

Thanks for all your help, but I'm a bit confused what is the best solution to get ride of this problem?

ADD REPLYlink written 4.9 years ago by seta1.2k
1

With the number of reads you have, I would suggest installing pairfq and then running pairfq makepairs with the --index option. That will keep the memory usage very low, but it will take longer. I don't know of another solution that work with that many reads.

ADD REPLYlink modified 4.9 years ago • written 4.9 years ago by SES8.2k

Many thanks SES, I'll try it. 

ADD REPLYlink written 4.9 years ago by seta1.2k

Follow what SES recommended. You can also search this forum using "orphan reads" or "unordered fastq". There are several posts that have discussed this and there are numerous solutions proposed. Just pick any one that has been liked and accepted by other users.

ADD REPLYlink written 4.9 years ago by Ashutosh Pandey11k

What you mean by broken reads? We have sth like orphan reads at the fastq level where one of the read from the pair was discarded because it couldn't pass the QC (length and base quality). This produces a pair of fastq files that dont have same number of reads as well as the reads are not in order.

OR this has to do with the insert size as seta suggested i.e. in the final assembly the reads from a pair were placed far apart than expected based on insert size. 

 

ADD REPLYlink modified 4.9 years ago • written 4.9 years ago by Ashutosh Pandey11k

Are you meaning to ask me or OP questions? Anyway, the first situation you mention is clearly what I was referring to, though this could lead to assembly artifacts. 

ADD REPLYlink written 4.9 years ago by SES8.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1930 users visited in the last hour