There's a nice discussion about how to efficiently retrieve a selection of reads from a fastq file here: How To Efficiently Parse A Huge Fastq File?
The efficient solutions proposed in that discussion seem to be based on a read-by-read parsing of the fastq file and testing whether the read belongs to the set of selected reads.
I tried to implement an index-based solution using biopython (see SeqIO.index in http://biopython.org/wiki/SeqIO#Sequence_Input) as follows:
import sys
from Bio import SeqIO
strip = str.rstrip
in_fastq = sys.argv[1]
record_dict = SeqIO.index(sys.argv[1], "fastq")
with open(sys.argv[2], "r") as in_names:
for line in in_names:
SeqIO.write(record_dict[strip(line)], sys.stdout, "fastq")
On the example I tested (extracting a list of about 300.000 reads from a fastq file containing about 3.000.000 reads), this was way much slower than the other type of solution.
Are there cases where an index-based solution might be appropriate?
Maybe when the list of reads is not known beforehand, and reads have to be extracted on-demand, a persistent index would be useful. What would be a good way to make the index persistent? Creating a read database?
The fastest way to test the inclusion of a very large set of objects in a small set is by hashing (or doing some other expensive operation on) the small set and doing a hash-lookup (or other cheap operation) on members of the large set, not by doing an expensive operation on the large set. This is true inside and outside of bioinformatics. Indexing is not useful unless the large set needs to be accessed many times or randomly - in other words, it is never useful if the procedure can be implemented in a single pass.
IGV, a genome visualizer, is a good example of something that cannot be implemented in one pass, since it's interactive; indexing is thus very useful.