Question

Sorting Fastq Files After Trimming (Orphans And Pe)

6

Entering edit mode

11.3 years ago

neal.platt ▴ 240

I have a bunch of Illumina PE data that has been run through fastx trimmer and clipper. I am ready to map these reads, but am needing to create 2 files for paired end reads (the left and right hand reads in separate files) and a file with the orphaned reads. Of course the paired end files need to have the reads in the same order.

This has to be a common problem, but I can't seem to find a tool that parses fastq files in this way (I swear I searched the Biostar forum).

Any help would be greatly appreciated.

NP

illumina paired-end mapping • 11k views

ADD COMMENT • link updated 11.3 years ago by Ryan Thompson ★ 3.6k • written 11.3 years ago by neal.platt ▴ 240

0

Entering edit mode

How look headers of two reads that make up a pair?

ADD REPLY • link 11.3 years ago by Biomonika (Noolean) 3.2k

0

Entering edit mode

The header for Read1:
@D3NH4HQ1:150:C1FTFACXX:7:1101:1605:2090 1:N:0:TTAGGC

The header for Read 2:
@D3NH4HQ1:150:C1FTFACXX:7:1101:1605:2090 2:N:0:TTAGGC

I do have a script to convert the headers to:
@D3NH4HQ1:150:C1FTFACXX:7:1101:1605:2090#0/1
@D3NH4HQ1:150:C1FTFACXX:7:1101:1605:2090#0/2

Thanks,

NP

ADD REPLY • link 11.3 years ago by neal.platt ▴ 240

score 3 · Answer 1 · 2012-12-24

FIrst of all, processing reads separately is a bad practice we should always avoid. That is why you have not seen related questions here. Usually, mappers are able to perform basic filtering. We do not need to throw reads before hand. When we have to drop reads, we should mark the reads to be deleted (e.g. trim reads down to 1bp) and then merge the two ends. In that case, we just need to sequentially read two files and that is very easy and fast.

In your case, you can compare the original fastq and the read-dropped fastq to mark dropped reads. If you do not have the original fastq, use sort as follows:

mkfifo tmp
awk 'NR%4==1{n=$1}NR%4==2{s=$1}NR%4==0{print n,s,$1}' r1.fq | sort -S 2G > tmp &
awk 'NR%4==1{n=$1}NR%4==2{s=$1}NR%4==0{print n,s,$1}' r2.fq | sort -S 2G | join -a1 -a2 tmp - | awk 'NF==5{print $1"\n"$2"\n+\n"$3 >"x1.fq";print $1"\n"$4"\n+\n"$5 >"x2.fq"}NF==3{print $1"\n"$2"\n+\n"$3>"orphan.fq"}'

This procedure only requires 2G memory. You may also use Alex's script, but make sure you have a machine with RAM huge enough to hold your sequences. Again, I am showing the solution, but we should avoid this complication in the first place.

score 0 · Answer 2 · 2012-12-23

This seems like a pretty straightforward scripting exercise. This script should work with your read naming scheme, if its use of a hash table might be a bit memory-intensive. How well this works may depend on the size of your input and system memory. To use it, run something like:

$ ./splitFq.pl < convertedReads.fq

This will create two files firstPair.fq and secondPair.fq, with reads in identical order (if read data are present for a given header). So if there are reads present in the input for headers A/1, A/2, B/1, C/1, C/2 (in whatever order: C, A, B, or B, C, A, etc.), then the output for the first file will be read data for A, B, C and the output for the second file will be read data for A, C — in that order. If header pairs match, then you will have identical ordering in both files.

Note the renaming of headers in the two output files, where the last two characters are stripped.

#!/usr/bin/env perl

use strict;
use warnings;

my $header;
my $pair;
my $sequence;
my $secondHeader;
my $quality;
my $lineIdx = 0;
my $fqRef;

while (<>) {
    chomp;
    if ($lineIdx == 0) {
        $header = $_;
        $pair = substr $header, -1, 1; # get read pair number, either '1' or '2'
        $header = substr $header, 0, length($header) - 2; # strip last two characters to make an ID that points to both read pairs
    }
    elsif ($lineIdx == 1) {
        $sequence = $_;
    }
    elsif ($lineIdx == 2) {
        $secondHeader = $_;
    }
    elsif ($lineIdx == 3) {
        $quality = $_;
    }
    $lineIdx++;
    if ($lineIdx == 4) {
        $lineIdx = 0;
        $fqRef->{$header}->{$pair}->{sequence} = $sequence;
        $fqRef->{$header}->{$pair}->{secondHeader} = $secondHeader;
        $fqRef->{$header}->{$pair}->{quality} = $quality;
    }
}

open FIRSTPAIR, "> firstPair.fq";
open SECONDPAIR, "> secondPair.fq";
foreach $header (sort keys %{$fqRef}) {
    if (defined $fqRef->{$header}->{1}->{sequence}) {
        $sequence = $fqRef->{$header}->{1}->{sequence};
        $secondHeader = $fqRef->{$header}->{1}->{secondHeader};
        $quality = $fqRef->{$header}->{1}->{quality};
        my $firstFq = "$header\n$sequence\n$secondHeader\n$quality";
        print FIRSTPAIR "$firstFq\n";
    }
    if (defined $fqRef->{$header}->{2}->{sequence}) {
        $sequence = $fqRef->{$header}->{2}->{sequence};
        $secondHeader = $fqRef->{$header}->{2}->{secondHeader};
        $quality = $fqRef->{$header}->{2}->{quality};
        my $secondFq = "$header\n$sequence\n$secondHeader\n$quality";
        print SECONDPAIR "$secondFq\n";
    }
}
close SECONDPAIR;
close FIRSTPAIR;

score 0 · Answer 3 · 2013-01-03

I've done something like this previously. I.e. take two fastq files for the /1 and /2 read pairs, do quality trimming etc and then interleave them into a single fastq file and spit out orphans into their own file. My approach (http://bigsa.org.au/node/92) uses cdbfasta and awk:

Use cdbfasta to index a concatenated fastq file of the two paired fastq files.
Use cdbyank to pull out the read IDs from the index
Parse the read IDs with awk, in 2 passes, to identify duplicate and orphaned read names. Assuming "/" delimits the the template name from the read pair suffix.
Use cdbyank to pull out the fastq sequences from the concatenated fastq file in step 1 for the paired and orphaned reads using the read IDs from step 3.
It you want deinterleaved pairs, run the interleaved fastq file through my deinterleaving script.

There are lots of was to do this, but these scripts might give you a head start!

score 0 · Answer 4 · 2013-01-04

Perhaps you should interleave the read pairs into a single file, then do whatever trimming you require, and finally split the remaining interleaved pairs back out into separate files. This avoids the need to match up the reads afterward, because they will always be adjacent to each other.

Also, depending on your needs, SeqPrep might be a good fit for you. It does some kinds of trimming while preserving read pairs.