Question: Combining The Paired Reads From Illumina Run
gravatar for Surya Saha
9.8 years ago by
Surya Saha260
Surya Saha260 wrote:


I have two fastq files with the forward(/1) and reverse(/2) paired reads. The reads are not in same order in either file, some pairs are absent/missing and the files are 8 GB each with abt 30 mill reads each.

I am trying to pull out all the paired reads for which both fwd and rev exist and running out of memory using a Bioperl script. Are there any C or C++ based efficient tools out there for doing this? Any algorithm ideas where I don't need to read in both read files into memory and can just parse through them.



paired bioperl scripting • 17k views
ADD COMMENTlink modified 7.6 years ago by Eric Normandeau10k • written 9.8 years ago by Surya Saha260
gravatar for Eric Normandeau
7.6 years ago by
Quebec, Canada
Eric Normandeau10k wrote:

(2013-06-09 EDIT: After a comment from a user, I have fixed a bug in the script. The version on GitHub is up-to-date)

This script that I wrote combines two fastq files that have been trimmed and contain orphans. It should be using MUCH less memory than the AWK script for big files:

It reads one sequence from file1, searches in hash2 is the corresponding sequence has been read yet. If yes, it writes both sequences to files, if not, it adds the sequence to hash1. Then, a sequence is read from file2, with the same pattern until the end of both files.

Here would be an example call to it:

python  fastq1  fastq2

NOTE: If it can be assumed that sequences in both files are ordered, the script could be made much more memory efficient at the expense of a bit more computation. For example, if we read sequence X in fastq1 and that it is also in fastq2, it could be assumed that all sequences that have been added to hash1 before can be flushed. This would lead to a very low memory footprint.

ADD COMMENTlink modified 7.6 years ago • written 7.6 years ago by Eric Normandeau10k

Thanks for great script! Very fast implementation. However, I would like to add that for me it works only in python 2.7.3 environment.

ADD REPLYlink written 7.3 years ago by Biomonika (Noolean)3.1k

My pleasure :) Do you mean it does not work with Python 3 (which it is not supposed to) or that it does not work in older environments (like Python 2.6)?

ADD REPLYlink modified 7.3 years ago • written 7.3 years ago by Eric Normandeau10k

Works a treat,

Thanks Eric!

ADD REPLYlink written 5.8 years ago by Jonathan Crowther200

My pleasure to see that this script is still being used by new people.

ADD REPLYlink modified 5.6 years ago • written 5.6 years ago by Eric Normandeau10k

Thanks for the script

ADD REPLYlink modified 5.3 years ago • written 5.3 years ago by dmenning0

This script is executing fine on my perfect results when I execute it on my laptop. But when I try to execute it on the server, it generates blank files. I don't know why. The operating system of my server is Debian. Do have any Idea like how can I fix this??

ADD REPLYlink modified 4.3 years ago • written 4.3 years ago by s.singh20

The script was written for Python 2.7. There is a possibility it will not work if you are using Python 3.x the server. You can test this with which python.

ADD REPLYlink written 4.2 years ago by Eric Normandeau10k

It doesn't work for me.

$ python --version
Python 2.7.13

$ which python

I run python fastq1 fastq2. My fastq1 is 4.6 MB and fastq2 is 4.5MB. The command run quickly. Two output files (_pairs_R1.fastq and _pairs_R2.fastq) do not have sequences (0 bytes). The xx_singles.fastq is 9.2 MB.

For more information:

The header of 1st sequence in fastq1:

@M03670:6:000000000-BCN4V:1:1101:16480:1645_SUB_CMP reverse_score=80.0; status=partial; direction=reverse; tail_quality=39.0; reverse_match=ccacctatcacataatcatg; seq_length_ori=151; seq_length=121; sample=BI_0_1_1; experiment=eDNA; location=BI_0; mid_quality=38.5267175573; head_quality=33.3; avg_quality=38.2119205298; grab=BI_0_1; reverse_primer=ccacctatcacayaatcatg;   1:N:0:1

The header of 1st sequence in fastq2:

@M03670:6:000000000-BCN4V:1:1101:16480:1645_SUB_CMP reverse_score=76.0; status=partial; direction=reverse; tail_quality=38.5; reverse_match=aagggcaccacaagaacgc; seq_length_ori=151; seq_length=122; sample=BI_0_1_1; experiment=eDNA; location=BI_0; mid_quality=38.3893129771; head_quality=34.6; avg_quality=38.1456953642; grab=BI_0_1; reverse_primer=aagggcaccacaagaacgc;   2:N:0:1
ADD REPLYlink modified 16 months ago by _r_am32k • written 3.3 years ago by hexphe10

As I replied to you on GitHub: If you launch the command with " " at the end, it will tell the script to use a space as a separator. Your name format is a bit strange so possibly you'll need to test some other options. Please report if this solves your problem. If not, please contact me by email with sample files (~100 sequences per file) so we can find a solution.

ADD REPLYlink written 3.3 years ago by Eric Normandeau10k

The new script you sent to me by email works when I use " " as a separator. Thank you very much for your help!

ADD REPLYlink modified 3.3 years ago • written 3.3 years ago by hexphe10
gravatar for lh3
9.8 years ago by
United States
lh332k wrote:

I have not tried, but something following this line:

awk '{printf substr($0,1,length-2);getline;printf "\t"$0;getline;getline;print "\t"$0}' read1.fq | sort -S 8G -T. > read1.txt
awk '{printf substr($0,1,length-2);getline;printf "\t"$0;getline;getline;print "\t"$0}' read2.fq | sort -S 8G -T. > read2.txt
join read1.txt read2.txt | awk '{print $1"\n"$2"\n+\n"$3 > "r1.fq";print $1"\n"$4"\n+\n"$5 > "r2.fq"}'

Basically, you convert fastq to TAB delimited format, sort them by read names, join sorted files together and then convert back to fastq.

EDIT: strip "/[12]" in read names.

ADD COMMENTlink modified 9.8 years ago • written 9.8 years ago by lh332k

"join" discards orphans. If you want to keep them, you can output orphan lines with "join -a". Writing your own script to join is also easy given two sorted files.

ADD REPLYlink written 9.8 years ago by lh332k

how does this deal with orphans?

ADD REPLYlink written 9.8 years ago by Jeremy Leipzig19k

I have a quite similar problem. Currently, I sort the file and reads one line by one line to do pairing, but that is time and memory consuming.

ADD REPLYlink written 9.7 years ago by Gangcai230
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 978 users visited in the last hour