Question: Extract Reads In Fastq Using A Bed File.
1
gravatar for Kasthuri
6.0 years ago by
Kasthuri240
United States
Kasthuri240 wrote:

I need to extract reads from a fastq file using genomic coordinates in a bed file. The output should again be a fastq file. Is there any easier way to do this other than converting the fastq to sam using Picard?

Thanks

bed fastq • 4.1k views
ADD COMMENTlink modified 19 months ago by michael.ante3.3k • written 6.0 years ago by Kasthuri240
4
gravatar for nilshomer
6.0 years ago by
nilshomer60
nilshomer60 wrote:

A fastq file does not contain coordinate information. First, you must map/align the data to your reference genome (say with bwa), and then you can extract the sequences/reads using the coordinates in the BED file.

ADD COMMENTlink written 6.0 years ago by nilshomer60

Thanks a lot. Will try that.

ADD REPLYlink written 6.0 years ago by Kasthuri240
1
gravatar for steve
19 months ago by
steve2.0k
United States
steve2.0k wrote:

This is an old question, but I recently had this same problem, and here is what I came up with. I've uploaded the scripts I used here

I had:

  • .fastq.gz files from which I wanted to extract a subset of reads

  • a .bed file which contained coordinates for which I wanted reads from the .fastq.gz files

  • .bam alignments for each set of .fastq.gz files

I also wanted to make sure I was extracting the exact entries from the original .fastq.gz files, instead of converting back & forth between file formats. I decided to use BioPython's SeqIO, and pysam Python packages to make the scripting less complicated. Also using Python's gzip library to read and write to/from the .gz files directly.

Setup

First thing I did was create a .tsv formatted samplesheet that listed the samples, and the corresponding .bam and .fastq.gz files, shown here

Scripting

Here is an overview of the script I set up for this.

Import Paths & Regions

Load packages and import file paths from samplesheet and target regions

import os
import sys
import csv
import gzip
from Bio import SeqIO
import pysam


# file with sample IDs, paths to .fastq and .bam files
samples_file = "sample_files.tsv"
# file with genomic regions to use
target_file = "targets.bed"
# directory to save the output fastq files
output_dir= "output_fastq"


# read in the samples
samples = []
with open(samples_file) as f:
    reader = csv.DictReader(f, delimiter = '\t')
    for sample in reader:
        samples.append(sample)

# split the fastq and bam entries
for sample in samples:
    sample['fastq'] = sample['fastq'].split(',')
    sample['bam'] = sample['bam'].split(',')
    # add 'qname' for later
    sample['qname'] = []
    # make fastq output file paths
    sample['output_fastq'] = {}
    for fastq in sample['fastq']:
        sample['output_fastq'][fastq] = os.path.join(output_dir, os.path.basename(fastq)) # .strip('.gz')

# read in the target regions
targets = []
with open(target_file) as f:
    for line in f:
        chrom, start, stop = line.split()
        targets.append((chrom, int(start), int(stop)))

Find .bam Read ID's

Get the ID's of the sequences from the alignment file that match the regions we want

# get the read IDs ('qname') for the targets from each bam per sample
for chrom, start, stop in targets:
    for sample in samples:
        for bam_file in sample['bam']:
            # load the bam
            bam = pysam.AlignmentFile(bam_file, "rb")
            for read in bam.fetch(chrom, start, stop):
                sample['qname'].append(read.qname)
            bam.close()


# get only the unique qnames per sample
for sample in samples:
    sample['qname'] = sorted(set(sample['qname']))

Find Matching .fastq.gz Reads

Find the sequences in the .fastq.gz files that match the ID's we found in the .bam files for the desired regions.

# get the fastq reads that match
for sample in samples:
    for fastq in sample['fastq']:
        output_fastq = sample['output_fastq'][fastq]
        with gzip.open(fastq) as gz_in, gzip.open(output_fastq, 'wb') as gz_out:
            input_seq_iterator = SeqIO.parse(gz_in, "fastq")
            seq_iterator = (record for record in input_seq_iterator if record.id in sample['qname'])
            SeqIO.write(seq_iterator, gz_out, "fastq")

Conclusion

The full script for this is located here. Since this runs each sample sequentially, it is very slow (took 10hrs to extract ~10-20k reads from the listed .fastq files), I also wrote a version here that only operates on one sample at a time, so you can run multiple samples in parrallel such as on an HPC cluster using a script like this one

Hope this helps!

ADD COMMENTlink modified 19 months ago • written 19 months ago by steve2.0k
0
gravatar for Alex Reynolds
6.0 years ago by
Alex Reynolds28k
Seattle, WA USA
Alex Reynolds28k wrote:

You can convert read sequences to FASTA, align to your genome with BLAT, and take the resulting PSL file and convert it to BED to do set operations on.

The resulting BED file can be used as a lookup table for read IDs, for use in filtering FASTQ records.

ADD COMMENTlink written 6.0 years ago by Alex Reynolds28k

Thanks Alex. I think this will be a good solution as I need to extract exome regions from genome which is already mapped by illumina (paired-end). I don't think illumina uses bwa which I guess is a better aligner. I wanted to realign using bwa and to do that I am converting the aligned file to fastq. I don't want to realign the whole genome as I need only exonic regions. And hence this question.

ADD REPLYlink written 6.0 years ago by Kasthuri240
0
gravatar for swbarnes2
19 months ago by
swbarnes25.9k
United States
swbarnes25.9k wrote:

Your fastq has no inherent mapping information in it. FastqToSam will make you an unmapped bam; bam format, with no mapping positions.

You need to map first, to know what reads are in the desired region, pull out those reads, then convert those back to fastq. BEDTools can get you the intersect between a .bed file and a .bam. Pipe that output to Picard or samtools, which can convert a bam to a fastq on the fly.

ADD COMMENTlink modified 19 months ago • written 19 months ago by swbarnes25.9k
0
gravatar for michael.ante
19 months ago by
michael.ante3.3k
Austria/Vienna
michael.ante3.3k wrote:

I have a rather different approach in mind, but not fully tested.

Using your BED file, you can extract the genomic sequences of interest as a fasta file (e.g. looping over the bed file and using samtools faidx). With BBSplit, you can extract all those fastq entries mapping to these regions. This should perform a bit faster than the map-against-all approach.

ADD COMMENTlink written 19 months ago by michael.ante3.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1473 users visited in the last hour