Question: Extract Reads In Fastq Using A Bed File.
gravatar for Kasthuri
7.4 years ago by
United States
Kasthuri270 wrote:

I need to extract reads from a fastq file using genomic coordinates in a bed file. The output should again be a fastq file. Is there any easier way to do this other than converting the fastq to sam using Picard?


bed fastq • 5.4k views
ADD COMMENTlink modified 3.0 years ago by michael.ante3.6k • written 7.4 years ago by Kasthuri270
gravatar for nilshomer
7.4 years ago by
nilshomer60 wrote:

A fastq file does not contain coordinate information. First, you must map/align the data to your reference genome (say with bwa), and then you can extract the sequences/reads using the coordinates in the BED file.

ADD COMMENTlink written 7.4 years ago by nilshomer60

Thanks a lot. Will try that.

ADD REPLYlink written 7.4 years ago by Kasthuri270
gravatar for steve
3.0 years ago by
United States
steve2.6k wrote:

This is an old question, but I recently had this same problem, and here is what I came up with. I've uploaded the scripts I used here

I had:

  • .fastq.gz files from which I wanted to extract a subset of reads

  • a .bed file which contained coordinates for which I wanted reads from the .fastq.gz files

  • .bam alignments for each set of .fastq.gz files

I also wanted to make sure I was extracting the exact entries from the original .fastq.gz files, instead of converting back & forth between file formats. I decided to use BioPython's SeqIO, and pysam Python packages to make the scripting less complicated. Also using Python's gzip library to read and write to/from the .gz files directly.


First thing I did was create a .tsv formatted samplesheet that listed the samples, and the corresponding .bam and .fastq.gz files, shown here


Here is an overview of the script I set up for this.

Import Paths & Regions

Load packages and import file paths from samplesheet and target regions

import os
import sys
import csv
import gzip
from Bio import SeqIO
import pysam

# file with sample IDs, paths to .fastq and .bam files
samples_file = "sample_files.tsv"
# file with genomic regions to use
target_file = "targets.bed"
# directory to save the output fastq files
output_dir= "output_fastq"

# read in the samples
samples = []
with open(samples_file) as f:
    reader = csv.DictReader(f, delimiter = '\t')
    for sample in reader:

# split the fastq and bam entries
for sample in samples:
    sample['fastq'] = sample['fastq'].split(',')
    sample['bam'] = sample['bam'].split(',')
    # add 'qname' for later
    sample['qname'] = []
    # make fastq output file paths
    sample['output_fastq'] = {}
    for fastq in sample['fastq']:
        sample['output_fastq'][fastq] = os.path.join(output_dir, os.path.basename(fastq)) # .strip('.gz')

# read in the target regions
targets = []
with open(target_file) as f:
    for line in f:
        chrom, start, stop = line.split()
        targets.append((chrom, int(start), int(stop)))

Find .bam Read ID's

Get the ID's of the sequences from the alignment file that match the regions we want

# get the read IDs ('qname') for the targets from each bam per sample
for chrom, start, stop in targets:
    for sample in samples:
        for bam_file in sample['bam']:
            # load the bam
            bam = pysam.AlignmentFile(bam_file, "rb")
            for read in bam.fetch(chrom, start, stop):

# get only the unique qnames per sample
for sample in samples:
    sample['qname'] = sorted(set(sample['qname']))

Find Matching .fastq.gz Reads

Find the sequences in the .fastq.gz files that match the ID's we found in the .bam files for the desired regions.

# get the fastq reads that match
for sample in samples:
    for fastq in sample['fastq']:
        output_fastq = sample['output_fastq'][fastq]
        with as gz_in,, 'wb') as gz_out:
            input_seq_iterator = SeqIO.parse(gz_in, "fastq")
            seq_iterator = (record for record in input_seq_iterator if in sample['qname'])
            SeqIO.write(seq_iterator, gz_out, "fastq")


The full script for this is located here. Since this runs each sample sequentially, it is very slow (took 10hrs to extract ~10-20k reads from the listed .fastq files), I also wrote a version here that only operates on one sample at a time, so you can run multiple samples in parrallel such as on an HPC cluster using a script like this one

Hope this helps!

ADD COMMENTlink modified 3.0 years ago • written 3.0 years ago by steve2.6k
gravatar for Alex Reynolds
7.4 years ago by
Alex Reynolds31k
Seattle, WA USA
Alex Reynolds31k wrote:

You can convert read sequences to FASTA, align to your genome with BLAT, and take the resulting PSL file and convert it to BED to do set operations on.

The resulting BED file can be used as a lookup table for read IDs, for use in filtering FASTQ records.

ADD COMMENTlink written 7.4 years ago by Alex Reynolds31k

Thanks Alex. I think this will be a good solution as I need to extract exome regions from genome which is already mapped by illumina (paired-end). I don't think illumina uses bwa which I guess is a better aligner. I wanted to realign using bwa and to do that I am converting the aligned file to fastq. I don't want to realign the whole genome as I need only exonic regions. And hence this question.

ADD REPLYlink written 7.4 years ago by Kasthuri270
gravatar for swbarnes2
3.0 years ago by
United States
swbarnes29.2k wrote:

Your fastq has no inherent mapping information in it. FastqToSam will make you an unmapped bam; bam format, with no mapping positions.

You need to map first, to know what reads are in the desired region, pull out those reads, then convert those back to fastq. BEDTools can get you the intersect between a .bed file and a .bam. Pipe that output to Picard or samtools, which can convert a bam to a fastq on the fly.

ADD COMMENTlink modified 3.0 years ago • written 3.0 years ago by swbarnes29.2k
gravatar for michael.ante
3.0 years ago by
michael.ante3.6k wrote:

I have a rather different approach in mind, but not fully tested.

Using your BED file, you can extract the genomic sequences of interest as a fasta file (e.g. looping over the bed file and using samtools faidx). With BBSplit, you can extract all those fastq entries mapping to these regions. This should perform a bit faster than the map-against-all approach.

ADD COMMENTlink written 3.0 years ago by michael.ante3.6k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1231 users visited in the last hour