Question: Faster processing of paired end FASTQ files
gravatar for volatic
2.3 years ago by
volatic0 wrote:

Hi all!

I was wondering whether anyone would have any insights into how to make the following python script faster. I need to access a set of paired end FASTQ files, and sort the reads into three distinct files depending on which adaptor sequence is present (or whether none of them is present). However, my current implementation takes over 5h+ for around 20 million reads... Would anyone have any hints into how to make it more efficient? I'm already using the low level FastqGeneralIterator which is supposed to be faster...

Thanks in advance!!

import gzip
import sys
from Bio import SeqIO
from Bio import Align
from Bio.SeqIO.QualityIO import FastqGeneralIterator

# Arguments
file1 = sys.argv[1]
file2 = sys.argv[2]
adaptor_root = sys.argv[3]
adaptor_five = sys.argv[4]
adaptor_three = sys.argv[5]
id = sys.argv[6]
cwd_path = sys.argv[7]
temp_path = sys.argv[8]

output_five_file_1 = temp_path + id + "_five_1.fastq.gz"
output_five_file_2 = temp_path + id + "_five_2.fastq.gz"
output_three_file_1 = temp_path + id + "_three_1.fastq.gz"
output_three_file_2 = temp_path + id + "_three_2.fastq.gz"
output_else_file_1 = temp_path + id + "_else_1.fastq.gz"
output_else_file_2 = temp_path + id + "_else_2.fastq.gz"

output_five_1 =, 'wt', 1)
output_five_2 =, 'wt', 1)
output_three_1 =, 'wt', 1)
output_three_2 =, 'wt', 1)
output_else_1 =, 'wt', 1)
output_else_2 =, 'wt', 1)

#Setup aligner with match 1, mismatch 0, local
aligner = Align.PairwiseAligner()
aligner.mode = 'local'
aligner.open_gap_score = 0

with, "rt") as handle1:
    with, "rt") as handle2:
        for (title1, seq1, qual1), (title2, seq2, qual2) in zip(FastqGeneralIterator(handle1), FastqGeneralIterator(handle2)):
            root_alignment = aligner.align(seq2[0:30], adaptor_root)
            if root_alignment.score > 20:
                five_alignment = aligner.align(seq2[0:30], adaptor_five).score
                three_alignment = aligner.align(seq2[0:30], adaptor_three).score
                # if a 5 prime alignment is likelier, then output it there, else output to 3 prime file
                if five_alignment > three_alignment:
                    output_five_1.write("@%s\n%s\n+\n%s\n" % (title1, seq1, qual1))
                    j = 30 + len(seq2[30:151]) - len(seq2[30:151].lstrip('G'))
                    output_five_2.write("@%s\n%s\n+\n%s\n" % (title2, seq2[j:151], qual2[j:151]))
                elif three_alignment > five_alignment:
                    output_three_1.write("@%s\n%s\n+\n%s\n" % (title1, seq1, qual1))
                    output_three_2.write("@%s\n%s\n+\n%s\n" % (title2, seq2[30:151], qual2[30:151]))
                output_else_1.write("@%s\n%s\n+\n%s\n" % (title1, seq1, qual1))
                output_else_2.write("@%s\n%s\n+\n%s\n" % (title2, seq2, qual2))

rna-seq alignment sequence • 920 views
ADD COMMENTlink modified 2.3 years ago by swbarnes28.9k • written 2.3 years ago by volatic0

Not an answer for your specific question but if you are looking to get this done fast I suggest using filtering mode of from BBMap suite with literal=sequence option. You can find a guide here.

ADD REPLYlink written 2.3 years ago by genomax91k

Sounds like a problem you can run faster using parallelization.

Unrelated: I'd suggest taking a look at the argparse module for parsing command line arguments.

ADD REPLYlink written 2.3 years ago by WouterDeCoster44k
gravatar for swbarnes2
2.3 years ago by
United States
swbarnes28.9k wrote:

Aligning to see if the adapter is there will be slow. There might not be a way around that. It might be faster to use a regex to look for exact matches, and align only the ones that fail the perfect regex match.

Are you sure that there is no out of the box software that can do this for you? I'm pretty sure that pairwise alignment is not the right algorithm for the job.

ADD COMMENTlink written 2.3 years ago by swbarnes28.9k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1441 users visited in the last hour