Extracting A Subset Of Sequences From A Fastq File (Biopython Speed)
5
7
Entering edit mode
14.0 years ago
Darked89 4.6k

Initially my problem was to extract all entries from a FASTQ file with names not present in a FASTA file. Using biopython I wrote:

from Bio.SeqIO.QualityIO import FastqGeneralIterator

corrected_fn   = "my_input_fasta.fas"
uncorrected_fn = "my_input_fastq.ftq"
output_fn      = "differences_fastq.ftq"

corrected_names = []
for line in open(corrected_fn):
        if line[0] == ">":
                read_name = line.split()[0][1:] 
                corrected_names.append(read_name)

input_fastq_fn = uncorrected_fn
corrected_names.sort()

handle = open(output_fn, "w")
for title, seq, qual in FastqGeneralIterator(open(input_fastq_fn)) :
        if title not in corrected_names:
                handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))
handle.close()

Problem is, it is very slow. On 2Ghz workstation starting from a local disc it can take two days per pair of files:

  • 4870868 seqs in FASTQ
  • 4299464 seqs in FASTA

Removing title from corrected_names speeds up things a bit (this version I used for running).

Am I doing something obviously silly or simply FastqGeneralIterator is not a best construct to use here? While I like Python best, I am open to answers in Perl/Ruby.

Slicing and dicing FASTQ files based on lists seems to be fairly common task.

Edit: Python 2.6.4, biopython 1.53, Linux Fedora 8.

Edit 2:

biopython python fastq next-gen-sequencing • 15k views
ADD COMMENT
0
Entering edit mode

input_fastq_fn is not defined (in the 4th-to-last line)

ADD REPLY
0
Entering edit mode

@giovanni: while cleaning up script for posting I cut one line to many (input_fastq_fn = uncorrected_fn) see also Edit 2 in the post

ADD REPLY
0
Entering edit mode

well, you don't need the input_fastq_fn variable anyway, you can use uncorrected_fn directly.

ADD REPLY
6
Entering edit mode
14.0 years ago

I can suggest a modification that should improve the speed issue, though I don't know by how much. Instead of making a list of your corrected_names, you could make a set and look if the values exists. Searching through a huge set for values will be much faster than sorting a list and searching through it. This is because a set is optimized for search speed.

It is, in addition, much more pythonic :)

from Bio.SeqIO.QualityIO import FastqGeneralIterator

corrected_fn   = "my_input_fasta.fas"
uncorrected_fn = "my_input_fastq.ftq"
output_fn      = "differences_fastq.ftq"

corrected_names = set() # Use a set instead of a list
for line in open(corrected_fn):
        if line[0] == ">":
                read_name = line.split()[0][1:] 
                corrected_names.add(read_name) # Add value to set

# corrected_names.sort() # No need, a set is orderless and optimized for search

handle = open(output_fn, "w")
for title, seq, qual in FastqGeneralIterator(open(input_fastq_fn)) :
        if title not in corrected_names:
                handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))
handle.close()

Tell me if this improves things! (Note: I didn't test this as a whole)

Cheers

ADD COMMENT
2
Entering edit mode

A set() might be even more pythonic. Replacing corrected_names[read_name] = 0 with correct_names.add(read_name])

ADD REPLY
1
Entering edit mode

As Will suggests, using a set rather than a dict would be more elegant and will waste less memory. Both the set and the dict will uses hashes to make the membership test MUCH faster than a list.

ADD REPLY
0
Entering edit mode

Yes, it works ant it is a dramatic speed improvement. Resulting files are identical to ones produced by my slow-as-hell variant :-). Thank you a lot!

ADD REPLY
0
Entering edit mode

One more thing: the old version was writing ca 100 sequences / minute (after creating the list, which was fairy fast). The new one writes almost 11 MILLIONS in less than 3 mins.

ADD REPLY
0
Entering edit mode

Nice :) Python dictionaries are often a better choice than lists, mostly if execution time is limiting and you have to do multiple searches in the data structure.

ADD REPLY
0
Entering edit mode

This is one useful thing I learned about Python thanks to you!

ADD REPLY
3
Entering edit mode
14.0 years ago
Paulo Nuin ★ 3.7k

A smallish change would be from

if line[0] == ">":

to

if line.startswith('>'):
ADD COMMENT
0
Entering edit mode

Sure is more elegant!

ADD REPLY
3
Entering edit mode
14.0 years ago
brentp 24k

it's also worth noting that there are a couple good options for random access in fasta sequences. with a branch in biopython:

#fi = SeqIO.indexed_dict(corrected_fn, corrected_fn + ".idx", "fastq")
fi = SeqIO.indexed_db(corrected_fn + ".idx", corrected_fn, "fastq")
handle = open(output_fn, "w")
for title, seq, qual in FastqGeneralIterator(open(input_fastq_fn)):
        if title not in fi: continue
        handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))
handle.close()

and also check out [?]screed[?] which does something very similar. these will both have overhead when first indexing the file but should be quite fast after that.

ADD COMMENT
0
Entering edit mode

That code ended up being called the Bio.SeqIO.index_db function, and will be in Biopython 1.57 due this month.

ADD REPLY
0
Entering edit mode

changed, thanks. and glad that's going in, it'll be quite useful.

ADD REPLY
0
Entering edit mode

Still not changed. The first line should be

fi = SeqIO.index_db(corrected_fn, corrected_fn + ".idx", "fastq")
ADD REPLY
0
Entering edit mode

The first line should be:

fi = SeqIO.index_db(corrected_fn + ".idx", corrected_fn, "fastq")
ADD REPLY
0
Entering edit mode

@Nick, indeed. updated.

ADD REPLY
3
Entering edit mode
14.0 years ago

You can do it with Unix command line tools instead of python: it should be faster because these tools are written in C directly. First, use grep on the first file to get the list of all the ids:

$: grep '>' my_input_fasta.fas|sed 's/>//g'|sort > my_input_fasta.ids
$: grep '>' my_input_fastq.ftq|sed 's/>//g'|sort > my_input_fastq.ids

Then, use comm to get the differences:

$: comm fas_ids ftq_ids 
       seq1
seq2
    seq3
    seq4

The first column tells you the ids belonging to fas_ids, the second the ones belonging to ftq_ids, and the third the ones in common. You can use the -1, -2, -3 options to get only the ids belonging to one file, e.g.:

$: comm fas_ids ftq_ids  -23
seq2

will give you the ids present in fas_ids but not in the other.

ADD COMMENT
2
Entering edit mode

WARNING: You can get false positives using this:

grep '^@' my_input_fastq.ftq

This is because quality lines can start with an @ character.

ADD REPLY
1
Entering edit mode

cool. i didn't know about comm

ADD REPLY
0
Entering edit mode
grep '^>' my_input_fasta.fas

will be even faster.

For FASTQ I will have to use:

grep '^@' my_input_fastq.ftq

Thanks for comm tip. I run away from (f)grep after getting a bit wrong results (names vary in length and can be substrings of other names)

ADD REPLY
0
Entering edit mode

Also originally I wanted to get sequences for some reason omitted from the output by Correction for SOAPdenovo. So getting names alone will not do. BTW, Eric's solution is more than fast enough.

ADD REPLY
0
Entering edit mode

@Peter: you are right that Illumina fastq standart contains @. Somehow quality strings in my files do not have @ or A, but even for a sequence with a string of Ns have B-quality instead. In general, yes, grep is way more dangerous than i.e. proper fastq format parsers.

ADD REPLY
0
Entering edit mode

it is not grep that gives you false positives, it is the regular expression you write. If that is wrong, it doesn't matter whether you match it with grep or with python's re. Anyway, sorry for the error in the post, I never used fastq files..

ADD REPLY
0
Entering edit mode
12.7 years ago
Yifangt ▴ 10

My question is related, but not quite the same: I need to extract the substring of each sequence from a FASTQ file a sub-string of the sequence and the corresponding quality score, the rest untouched. My Illumina reads consists of 101bp long. I want remove the first 10bp and last 25bp then only the middle 66bp left.

The reason I need do this is my DNA sequence is methylated and the first 10 and last 25bp seem not having good quality in general so that I want get rid of them. My challenge is to remove both quality score and sequence correspondingly.

Seems quite easy, but I need to learn python above perl. Any tools there for me? Thanks a lot!

Yifang

ADD COMMENT
0
Entering edit mode

Welcome to BioStar. You should post this as a new question, instead of as an answer to an existing question.

ADD REPLY

Login before adding your answer.

Traffic: 1907 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6