Extracting A Subset Of Sequences From A Fastq File (Biopython Speed)
5
7
Entering edit mode
13.0 years ago
Darked89 4.2k

Initially my problem was to extract all entries from a FASTQ file with names not present in a FASTA file. Using biopython I wrote:

from Bio.SeqIO.QualityIO import FastqGeneralIterator

corrected_fn   = "my_input_fasta.fas"
uncorrected_fn = "my_input_fastq.ftq"
output_fn      = "differences_fastq.ftq"

corrected_names = []
for line in open(corrected_fn):
if line[0] == ">":

input_fastq_fn = uncorrected_fn
corrected_names.sort()

handle = open(output_fn, "w")
for title, seq, qual in FastqGeneralIterator(open(input_fastq_fn)) :
if title not in corrected_names:
handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))
handle.close()


Problem is, it is very slow. On 2Ghz workstation starting from a local disc it can take two days per pair of files:

• 4870868 seqs in FASTQ
• 4299464 seqs in FASTA

Removing title from corrected_names speeds up things a bit (this version I used for running).

Am I doing something obviously silly or simply FastqGeneralIterator is not a best construct to use here? While I like Python best, I am open to answers in Perl/Ruby.

Slicing and dicing FASTQ files based on lists seems to be fairly common task.

Edit: Python 2.6.4, biopython 1.53, Linux Fedora 8.

Edit 2:

fastq biopython next-gen sequencing python • 13k views
0
Entering edit mode

input_fastq_fn is not defined (in the 4th-to-last line)

0
Entering edit mode

@giovanni: while cleaning up script for posting I cut one line to many (input_fastq_fn = uncorrected_fn) see also Edit 2 in the post

0
Entering edit mode

well, you don't need the input_fastq_fn variable anyway, you can use uncorrected_fn directly.

6
Entering edit mode
13.0 years ago

I can suggest a modification that should improve the speed issue, though I don't know by how much. Instead of making a list of your corrected_names, you could make a set and look if the values exists. Searching through a huge set for values will be much faster than sorting a list and searching through it. This is because a set is optimized for search speed.

It is, in addition, much more pythonic :)

from Bio.SeqIO.QualityIO import FastqGeneralIterator

corrected_fn   = "my_input_fasta.fas"
uncorrected_fn = "my_input_fastq.ftq"
output_fn      = "differences_fastq.ftq"

corrected_names = set() # Use a set instead of a list
for line in open(corrected_fn):
if line[0] == ">":

# corrected_names.sort() # No need, a set is orderless and optimized for search

handle = open(output_fn, "w")
for title, seq, qual in FastqGeneralIterator(open(input_fastq_fn)) :
if title not in corrected_names:
handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))
handle.close()


Tell me if this improves things! (Note: I didn't test this as a whole)

Cheers

2
Entering edit mode

A set() might be even more pythonic. Replacing corrected_names[read_name] = 0 with correct_names.add(read_name])

1
Entering edit mode

As Will suggests, using a set rather than a dict would be more elegant and will waste less memory. Both the set and the dict will uses hashes to make the membership test MUCH faster than a list.

0
Entering edit mode

Yes, it works ant it is a dramatic speed improvement. Resulting files are identical to ones produced by my slow-as-hell variant :-). Thank you a lot!

0
Entering edit mode

One more thing: the old version was writing ca 100 sequences / minute (after creating the list, which was fairy fast). The new one writes almost 11 MILLIONS in less than 3 mins.

0
Entering edit mode

Nice :) Python dictionaries are often a better choice than lists, mostly if execution time is limiting and you have to do multiple searches in the data structure.

0
Entering edit mode

This is one useful thing I learned about Python thanks to you!

3
Entering edit mode
13.0 years ago
Paulo Nuin ★ 3.7k

A smallish change would be from

if line[0] == ">":


to

if line.startswith('>'):

0
Entering edit mode

Sure is more elegant!

3
Entering edit mode
13.0 years ago
brentp 24k

it's also worth noting that there are a couple good options for random access in fasta sequences. with a branch in biopython:

#fi = SeqIO.indexed_dict(corrected_fn, corrected_fn + ".idx", "fastq")
fi = SeqIO.indexed_db(corrected_fn + ".idx", corrected_fn, "fastq")
handle = open(output_fn, "w")
for title, seq, qual in FastqGeneralIterator(open(input_fastq_fn)):
if title not in fi: continue
handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))
handle.close()


and also check out [?]screed[?] which does something very similar. these will both have overhead when first indexing the file but should be quite fast after that.

0
Entering edit mode

That code ended up being called the Bio.SeqIO.index_db function, and will be in Biopython 1.57 due this month.

0
Entering edit mode

changed, thanks. and glad that's going in, it'll be quite useful.

0
Entering edit mode

Still not changed. The first line should be

fi = SeqIO.index_db(corrected_fn, corrected_fn + ".idx", "fastq")

0
Entering edit mode

The first line should be:

fi = SeqIO.index_db(corrected_fn + ".idx", corrected_fn, "fastq")

0
Entering edit mode

@Nick, indeed. updated.

3
Entering edit mode
13.0 years ago

You can do it with Unix command line tools instead of python: it should be faster because these tools are written in C directly. First, use grep on the first file to get the list of all the ids:

$: grep '>' my_input_fasta.fas|sed 's/>//g'|sort > my_input_fasta.ids$: grep '>' my_input_fastq.ftq|sed 's/>//g'|sort > my_input_fastq.ids


Then, use comm to get the differences:

$: comm fas_ids ftq_ids seq1 seq2 seq3 seq4  The first column tells you the ids belonging to fas_ids, the second the ones belonging to ftq_ids, and the third the ones in common. You can use the -1, -2, -3 options to get only the ids belonging to one file, e.g.: $: comm fas_ids ftq_ids  -23
seq2


will give you the ids present in fas_ids but not in the other.

2
Entering edit mode

WARNING: You can get false positives using this:

grep '^@' my_input_fastq.ftq


This is because quality lines can start with an @ character.

1
Entering edit mode

cool. i didn't know about comm

0
Entering edit mode
grep '^>' my_input_fasta.fas


will be even faster.

For FASTQ I will have to use:

grep '^@' my_input_fastq.ftq


Thanks for comm tip. I run away from (f)grep after getting a bit wrong results (names vary in length and can be substrings of other names)

0
Entering edit mode

Also originally I wanted to get sequences for some reason omitted from the output by Correction for SOAPdenovo. So getting names alone will not do. BTW, Eric's solution is more than fast enough.

0
Entering edit mode

@Peter: you are right that Illumina fastq standart contains @. Somehow quality strings in my files do not have @ or A, but even for a sequence with a string of Ns have B-quality instead. In general, yes, grep is way more dangerous than i.e. proper fastq format parsers.

0
Entering edit mode

it is not grep that gives you false positives, it is the regular expression you write. If that is wrong, it doesn't matter whether you match it with grep or with python's re. Anyway, sorry for the error in the post, I never used fastq files..

0
Entering edit mode
11.6 years ago
Yifangt ▴ 10

My question is related, but not quite the same: I need to extract the substring of each sequence from a FASTQ file a sub-string of the sequence and the corresponding quality score, the rest untouched. My Illumina reads consists of 101bp long. I want remove the first 10bp and last 25bp then only the middle 66bp left.

The reason I need do this is my DNA sequence is methylated and the first 10 and last 25bp seem not having good quality in general so that I want get rid of them. My challenge is to remove both quality score and sequence correspondingly.

Seems quite easy, but I need to learn python above perl. Any tools there for me? Thanks a lot!

Yifang

0
Entering edit mode

Welcome to BioStar. You should post this as a new question, instead of as an answer to an existing question.