7.6 years ago by
Quebec, Canada
(2013-06-09 EDIT: After a comment from a user, I have fixed a bug in the script. The version on GitHub is up-to-date)
This script that I wrote combines two fastq files that have been trimmed and contain orphans. It should be using MUCH less memory than the AWK script for big files:
https://github.com/enormandeau/Scripts/blob/master/fastqCombinePairedEnd.py
It reads one sequence from file1, searches in hash2 is the corresponding sequence has been read yet. If yes, it writes both sequences to files, if not, it adds the sequence to hash1. Then, a sequence is read from file2, with the same pattern until the end of both files.
Here would be an example call to it:
python fastqCombinePairedEnd.py fastq1 fastq2
NOTE: If it can be assumed that sequences in both files are ordered, the script could be made much more memory efficient at the expense of a bit more computation. For example, if we read sequence X in fastq1 and that it is also in fastq2, it could be assumed that all sequences that have been added to hash1 before can be flushed. This would lead to a very low memory footprint.