why there is many broken paired reads in my Illumina reads and how to solve them?
1
0
Entering edit mode
9.4 years ago
seta ★ 1.9k

Hi everybody,

My question is: what is the origin of broken paired reads? I'm challenging with the de novo transcriptome assembly using CLC genomics workbench software. At the first, I did trimming and duplicate read removal and then assembly using this software. My quality of sequencing data (Illumina 100bp, paired read) seems very good, in fact about 97% of reads passed the trimming steps successfully and there was only 1% duplicate read in my total reads. However, my assembly result is not satisfactory (about 330 million reads from 370 million reads was reported as broken paired reads!). I would be highly appreciate if you could let me know why there is many broken reads and how I can solve them? thanks in advance.

Assembly next-gen RNA-Seq • 6.7k views
ADD COMMENT
0
Entering edit mode

If you're using a commercial product like CLC, then ask them questions like this. You are paying for support from them after all.

ADD REPLY
0
Entering edit mode
9.4 years ago
SES 8.6k

The origin of broken pairs is when one pair is removed (because it is below a quality or length threshold) while the other is not, thus leaving the pairs out of sync. Though, it is not really possible to know the source of the issue without knowing how the trimming was done. If the trimming was done with CLC then I agree with the above comment that it would be best for you to find a solution with the folks at CLC. That way you can keep all your analysis in the same place.

In case you aren't able to find the source of the issue (or a solution), you can use the Pairfq command pairfq makepairs (see the wiki page about that command for more information). The only caveat of this approach for you would be that you have to do some of your analysis in the CLC app, then go the command line to fix your paired-end files, then back to CLC for assembly...but it may solve your current issue.

ADD COMMENT
0
Entering edit mode

Thanks a lot for your response. it may be unlikely that one pair is removed because of bad quality, as I said my read quality was enough good. in your opinion, is it possible that it's because of incorrect insert distance? I'll try your command on the trimmed reads, I have almost no problem with linux.

ADD REPLY
0
Entering edit mode

Why do you say it is unlikely that a pair is removed after quality trimming? Almost all Illumina data I have seen is very high quality, but some reads will be removed (less that 1% usually). If you look at your stats after trimming, do the files have the same number of reads? If not, then this could be an issue if you used these reads in an assembly.

I think it would help if you could clarify what you mean by "broken" reads. I understand in general terms what you mean, but how are you defining "broken" and what is the threshold for falling into this category?

ADD REPLY
0
Entering edit mode

Hi SES, thanks a lot for your comments. No, they are not the same number, one of them is 189,616,838 and another is 185,818,436. your mean this slight difference can cause such a big problem?. It is not defined the threshold for defining broken reads, software just reported that the number of broken reads. it's the same general meaning that you got it. therefore, how I can solve this problem? just using your proposed script (Pairfq)?

ADD REPLY
1
Entering edit mode

A difference of 1, or even 0 if an equal number of reads are asymmetrically removed from each file, is enough to cause big problems. When dealing with paired ends, never use any tool that can't process both files at once (with the exception of tools like FastQC).

ADD REPLY
0
Entering edit mode

thanks to clarify me. I have two adapter sequence for each read that I tried to remove them using clc genomics workbench software. Could you please let me know how I can to remove two adapter sequence from paired reads simultaneously?

ADD REPLY
0
Entering edit mode

Well, I'd first complain to CLC, since they should provide a trimmer that can handle data like that if they don't already. secondly, common trimmers are skewer, trimmomatic and BBDuk (from BBMap).

However, you can just repair the files you have. BBMap comes with a script repair.sh I think) to do that conveniently.

ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode

Thanks Ashutosh, but I would like to know that combining paired reads is common? Also, SES introduced Pairfq to solve the problem. Neither I have tried none of them nor I saw in the paper, would you please let me know how reliable they are? is there any document about this issue? thanks so much

ADD REPLY
0
Entering edit mode

I referred to the documentation for Pairfq previously, and I would say it is more reliable than other solutions since it is the only tool that has tests. The other solutions are quite hackish and leave your data modified, so I wouldn't say they are reliable in any way. Also, I personally found the shell commands hard to modify for general use and the repetition of commands and then cleaning up afterwards is quite tedious. The BBMap solution requires the input to be interleaved (last I tried it). None of the them are published.

ADD REPLY
0
Entering edit mode

I wouldn't follow the advice in that thread...the awk suggestion specifically. It use way too much memory, writes a bunch of intermediate files, involves multiple commands, modifies the read ID, only works on fastq, doesn't write out the singleton reads...

ADD REPLY
0
Entering edit mode

Thanks for all your help, but I'm a bit confused what is the best solution to get ride of this problem?

ADD REPLY
1
Entering edit mode

With the number of reads you have, I would suggest installing pairfq and then running pairfq makepairs with the --index option. That will keep the memory usage very low, but it will take longer. I don't know of another solution that work with that many reads.

ADD REPLY
0
Entering edit mode

Many thanks SES, I'll try it.

ADD REPLY
0
Entering edit mode

Follow what SES recommended. You can also search this forum using "orphan reads" or "unordered fastq". There are several posts that have discussed this and there are numerous solutions proposed. Just pick any one that has been liked and accepted by other users.

ADD REPLY
0
Entering edit mode

What you mean by broken reads? We have something like orphan reads at the fastq level where one of the read from the pair was discarded because it couldn't pass the QC (length and base quality). This produces a pair of fastq files that don't have same number of reads as well as the reads are not in order.

OR this has to do with the insert size as seta suggested i.e. in the final assembly the reads from a pair were placed far apart than expected based on insert size.

ADD REPLY
0
Entering edit mode

Are you meaning to ask me or OP questions? Anyway, the first situation you mention is clearly what I was referring to, though this could lead to assembly artifacts.

ADD REPLY

Login before adding your answer.

Traffic: 1954 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6