Faster processing of paired end FASTQ files
Entering edit mode
2.8 years ago
volatic • 0

Hi all!

I was wondering whether anyone would have any insights into how to make the following python script faster. I need to access a set of paired end FASTQ files, and sort the reads into three distinct files depending on which adaptor sequence is present (or whether none of them is present). However, my current implementation takes over 5h+ for around 20 million reads... Would anyone have any hints into how to make it more efficient? I'm already using the low level FastqGeneralIterator which is supposed to be faster...

Thanks in advance!!

import gzip
import sys
from Bio import SeqIO
from Bio import Align
from Bio.SeqIO.QualityIO import FastqGeneralIterator

# Arguments
file1 = sys.argv[1]
file2 = sys.argv[2]
adaptor_root = sys.argv[3]
adaptor_five = sys.argv[4]
adaptor_three = sys.argv[5]
id = sys.argv[6]
cwd_path = sys.argv[7]
temp_path = sys.argv[8]

output_five_file_1 = temp_path + id + "_five_1.fastq.gz"
output_five_file_2 = temp_path + id + "_five_2.fastq.gz"
output_three_file_1 = temp_path + id + "_three_1.fastq.gz"
output_three_file_2 = temp_path + id + "_three_2.fastq.gz"
output_else_file_1 = temp_path + id + "_else_1.fastq.gz"
output_else_file_2 = temp_path + id + "_else_2.fastq.gz"

output_five_1 =, 'wt', 1)
output_five_2 =, 'wt', 1)
output_three_1 =, 'wt', 1)
output_three_2 =, 'wt', 1)
output_else_1 =, 'wt', 1)
output_else_2 =, 'wt', 1)

#Setup aligner with match 1, mismatch 0, local
aligner = Align.PairwiseAligner()
aligner.mode = 'local'
aligner.open_gap_score = 0

with, "rt") as handle1:
    with, "rt") as handle2:
        for (title1, seq1, qual1), (title2, seq2, qual2) in zip(FastqGeneralIterator(handle1), FastqGeneralIterator(handle2)):
            root_alignment = aligner.align(seq2[0:30], adaptor_root)
            if root_alignment.score > 20:
                five_alignment = aligner.align(seq2[0:30], adaptor_five).score
                three_alignment = aligner.align(seq2[0:30], adaptor_three).score
                # if a 5 prime alignment is likelier, then output it there, else output to 3 prime file
                if five_alignment > three_alignment:
                    output_five_1.write("@%s\n%s\n+\n%s\n" % (title1, seq1, qual1))
                    j = 30 + len(seq2[30:151]) - len(seq2[30:151].lstrip('G'))
                    output_five_2.write("@%s\n%s\n+\n%s\n" % (title2, seq2[j:151], qual2[j:151]))
                elif three_alignment > five_alignment:
                    output_three_1.write("@%s\n%s\n+\n%s\n" % (title1, seq1, qual1))
                    output_three_2.write("@%s\n%s\n+\n%s\n" % (title2, seq2[30:151], qual2[30:151]))
                output_else_1.write("@%s\n%s\n+\n%s\n" % (title1, seq1, qual1))
                output_else_2.write("@%s\n%s\n+\n%s\n" % (title2, seq2, qual2))

sequence rna-seq alignment • 1.1k views
Entering edit mode

Not an answer for your specific question but if you are looking to get this done fast I suggest using filtering mode of from BBMap suite with literal=sequence option. You can find a guide here.

Entering edit mode

Sounds like a problem you can run faster using parallelization.

Unrelated: I'd suggest taking a look at the argparse module for parsing command line arguments.

Entering edit mode
2.8 years ago
swbarnes2 9.7k

Aligning to see if the adapter is there will be slow. There might not be a way around that. It might be faster to use a regex to look for exact matches, and align only the ones that fail the perfect regex match.

Are you sure that there is no out of the box software that can do this for you? I'm pretty sure that pairwise alignment is not the right algorithm for the job.


Login before adding your answer.

Traffic: 3118 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6