Combining The Paired Reads From Illumina Run
2
9
Entering edit mode
11.1 years ago
Surya Saha ▴ 260

Hi,

I have two fastq files with the forward(/1) and reverse(/2) paired reads. The reads are not in same order in either file, some pairs are absent/missing and the files are 8 GB each with abt 30 mill reads each.

I am trying to pull out all the paired reads for which both fwd and rev exist and running out of memory using a Bioperl script. Are there any C or C++ based efficient tools out there for doing this? Any algorithm ideas where I don't need to read in both read files into memory and can just parse through them.

Thanks!

-S.

paired bioperl scripting • 19k views
15
Entering edit mode
9.0 years ago

(2013-06-09 EDIT: After a comment from a user, I have fixed a bug in the script. The version on GitHub is up-to-date)

This script that I wrote combines two fastq files that have been trimmed and contain orphans. It should be using MUCH less memory than the AWK script for big files:

https://github.com/enormandeau/Scripts/blob/master/fastqCombinePairedEnd.py

It reads one sequence from file1, searches in hash2 is the corresponding sequence has been read yet. If yes, it writes both sequences to files, if not, it adds the sequence to hash1. Then, a sequence is read from file2, with the same pattern until the end of both files.

Here would be an example call to it:

python fastqCombinePairedEnd.py  fastq1  fastq2


NOTE: If it can be assumed that sequences in both files are ordered, the script could be made much more memory efficient at the expense of a bit more computation. For example, if we read sequence X in fastq1 and that it is also in fastq2, it could be assumed that all sequences that have been added to hash1 before can be flushed. This would lead to a very low memory footprint.

0
Entering edit mode

Thanks for great script! Very fast implementation. However, I would like to add that for me it works only in python 2.7.3 environment.

0
Entering edit mode

My pleasure :) Do you mean it does not work with Python 3 (which it is not supposed to) or that it does not work in older environments (like Python 2.6)?

0
Entering edit mode

Works a treat,

Thanks Eric!

1
Entering edit mode

My pleasure to see that this script is still being used by new people.

0
Entering edit mode

Thanks for the script

0
Entering edit mode

This script is executing fine on my laptop...gives perfect results when I execute it on my laptop. But when I try to execute it on the server, it generates blank files. I don't know why. The operating system of my server is Debian. Do have any Idea like how can I fix this??

1
Entering edit mode

The script was written for Python 2.7. There is a possibility it will not work if you are using Python 3.x the server. You can test this with which python.

0
Entering edit mode

It doesn't work for me.

$python --version Python 2.7.13$ which python
/Library/Frameworks/Python.framework/Versions/2.7/bin/python


I run python fastqCombinePairedEnd.py fastq1 fastq2. My fastq1 is 4.6 MB and fastq2 is 4.5MB. The command run quickly. Two output files (_pairs_R1.fastq and _pairs_R2.fastq) do not have sequences (0 bytes). The xx_singles.fastq is 9.2 MB.

The header of 1st sequence in fastq1:

@M03670:6:000000000-BCN4V:1:1101:16480:1645_SUB_CMP reverse_score=80.0; status=partial; direction=reverse; tail_quality=39.0; reverse_match=ccacctatcacataatcatg; seq_length_ori=151; seq_length=121; sample=BI_0_1_1; experiment=eDNA; location=BI_0; mid_quality=38.5267175573; head_quality=33.3; avg_quality=38.2119205298; grab=BI_0_1; reverse_primer=ccacctatcacayaatcatg;   1:N:0:1


The header of 1st sequence in fastq2:

@M03670:6:000000000-BCN4V:1:1101:16480:1645_SUB_CMP reverse_score=76.0; status=partial; direction=reverse; tail_quality=38.5; reverse_match=aagggcaccacaagaacgc; seq_length_ori=151; seq_length=122; sample=BI_0_1_1; experiment=eDNA; location=BI_0; mid_quality=38.3893129771; head_quality=34.6; avg_quality=38.1456953642; grab=BI_0_1; reverse_primer=aagggcaccacaagaacgc;   2:N:0:1

0
Entering edit mode

As I replied to you on GitHub: If you launch the command with " " at the end, it will tell the script to use a space as a separator. Your name format is a bit strange so possibly you'll need to test some other options. Please report if this solves your problem. If not, please contact me by email with sample files (~100 sequences per file) so we can find a solution.

1
Entering edit mode

The new script you sent to me by email works when I use " " as a separator. Thank you very much for your help!

7
Entering edit mode
11.1 years ago
lh3 33k

I have not tried, but something following this line:

awk '{printf substr($0,1,length-2);getline;printf "\t"$0;getline;getline;print "\t"$0}' read1.fq | sort -S 8G -T. > read1.txt awk '{printf substr($0,1,length-2);getline;printf "\t"$0;getline;getline;print "\t"$0}' read2.fq | sort -S 8G -T. > read2.txt
join read1.txt read2.txt | awk '{print $1"\n"$2"\n+\n"$3 > "r1.fq";print$1"\n"$4"\n+\n"$5 > "r2.fq"}'


Basically, you convert fastq to TAB delimited format, sort them by read names, join sorted files together and then convert back to fastq.

EDIT: strip "/[12]" in read names.

1
Entering edit mode

"join" discards orphans. If you want to keep them, you can output orphan lines with "join -a". Writing your own script to join is also easy given two sorted files.

0
Entering edit mode

how does this deal with orphans?

0
Entering edit mode

I have a quite similar problem. Currently, I sort the file and reads one line by one line to do pairing, but that is time and memory consuming.