Question: Split Fastq Files With A Pattern
1
gravatar for Assa Yeroslaviz
5.4 years ago by
Assa Yeroslaviz1.1k
Munich
Assa Yeroslaviz1.1k wrote:

Hi everybody,

I would like to split a fastq file into two separate fastq files on a specific position. For example, I have these reads:

@HWI-ST225:523:D1AY5ACXX:8:1101:1566:2149 1:N:0:GTGGCC
GCATACCCTCCCTGTCTCAGTTGCTGTTGAAAGAAGAAATCCGCGATATCTTATCCAACCCGCGATATCTTATCCAACGAAGCCAAAACCCTCGCAGTCTG
+
??@DDDDDHHHHHIGIHGHICHEGAHC<HHII9?FB3?F4C<FDD>0?9D*99*??D@;AA9=?'''3@@>@A@>::?B?2?-?CC3<A??8?B@B#####
@HWI-ST225:523:D1AY5ACXX:8:1101:2000:2191 1:N:0:GTGGCC
TTGCTTGTTGCGTGTCTCAGGCGGAAAAACGCCAAAATCAACCGCGATATACATTCCAACCCGCGATATACATTCCAACGTCAGCTCTGAGCTGCTGATCT
+
@@CFFFFFHHHHFGFIIJJJIIIHHIJJIIGGGGIEHICHAEDEHHHHADDB9?BEDDDBBDDDDBBB>DDC@@C@CCDBBD05>CCDACDDDDDDC:3>>

If you look a bit closer, you can see that there is a pair of linkers in each of the reads. The linkers are:

A - CCGCGATAT CTTA TCCAAC

B - CCGCGATAT ACAT TCCAAC

The linkers differ in only four bases in the middle. I would like to split the fastq file into two, or trim each time either sides of the fastq file exactly between these two linkers (but still keep the quality values for the reads).

Does anyone know a way to do so?

Thanks Assa

fastq split • 2.6k views
ADD COMMENTlink modified 5.1 years ago by SRKR160 • written 5.4 years ago by Assa Yeroslaviz1.1k
2
gravatar for JC
5.4 years ago by
JC7.0k
Mexico
JC7.0k wrote:

One solution in Perl:

#!/usr/bin/perl

use strict;
use warnings;

my $linkA = 'CCGCGATATCTTATCCAAC';
my $linkB = 'CCGCGATATACATTCCAAC';

open L, ">left.fq"  or die;
open R, ">right.fq" or die;
my $nl = 0;
my ($id, $seq, $sep, $qual);
while (<>) {
    chomp;
    $nl++;
    if ($nl == 1) { 
        $id   = $_; 
    }
    elsif ($nl == 2) { 
        $seq  = $_; 
    }
    elsif ($nl == 3) { 
        $sep  = $_; 
    }
    elsif ($nl == 4) { 
        $qual = $_;
        splitRead($id, $seq, $sep, $qual);
        $nl = 0;
    }
}
close L;
close R;

sub splitRead {
    my ($id, $seq, $sep, $qual) = @_;
    my ($lseq, $rseq, $lqual, $rqual);
    if ($seq =~ m/$linkA$linkA/) {
        ($lseq, $rseq) = split (/$linkA$linkA/, $seq);
        $lqual = substr ($qual, 0, length $lseq);
        $rqual = substr ($qual, (2 * length $linkA) + length $lseq);
    }
    elsif ($seq =~ m/$linkB$linkB/) {
        ($lseq, $rseq) = split (/$linkB$linkB/, $seq);
        $lqual = substr ($qual, 0, length $lseq);
        $rqual = substr ($qual, (2 * length $linkB) + length $lseq);
    }
    else {
        warn "ERROR: read doesn't contain the linkers:\n$id $seq $sep $qual\n";
        return;
    }
    print L "$id\n$lseq\n$sep\n$lqual\n";
    print R "$id\n$rseq\n$sep\n$rqual\n";
    return;
}

Then, you run as:

perl splitReads.pl < file.fastq
ADD COMMENTlink written 5.4 years ago by JC7.0k

thanks for the help. Unfortunately I was wrong in my description. The combination of linkAlinkA is not correct. What I need is revcomp(linkA)linkA and rhe same for linker B. The reverse complement function I have already found:

sub reverse_complement {
        my $dna = shift;
    # reverse the DNA sequence
        my $revcomp = reverse($dna);    
    # complement the reversed DNA sequence
        $revcomp =~ tr/ACGTacgt/TGCAtgca/;
        return $revcomp;
}

To make the script works better I would also like to add two things (if possible)

  1. Is it possible to not only split the reads in the middle, but also trim them? I would like to have for the left linker also 20 bases before the linker and for the right linker I would like to have the 20 bases after the linker sequences.

  2. Is it possible to add an option for possible mismatches? Sometimes due to sequencing errors the linker sequences is not as exact as given in the pattern for linker A and B. Is it possible to add an option for adding mismatches?

Thanks Assa

ADD REPLYlink modified 5.4 years ago • written 5.4 years ago by Assa Yeroslaviz1.1k
1
gravatar for SRKR
5.1 years ago by
SRKR160
Visakhapatnam
SRKR160 wrote:

You can use simple regular expression to get it done. This is the regular expression you need to search for

(@[^\n]{1,}\n)([ATGC]{1,})CCGCGATAT[ATGC]{4}TCCAAC([ATGC]{1,})([\s]{1,}\+[\s]{1,}[^\n]{1,}\n)

Then simply replace it with

\1\2\4\1\3\4

This will remove your linker sequence completely and separate the sequences into two with same detail lines. The result for your query will be the following:

@HWI-ST225:523:D1AY5ACXX:8:1101:1566:2149 1:N:0:GTGGCC
GCATACCCTCCCTGTCTCAGTTGCTGTTGAAAGAAGAAATCCGCGATATCTTATCCAAC
+
??@DDDDDHHHHHIGIHGHICHEGAHC<HHII9?FB3?F4C<FDD>0?9D*99*??D@;AA9=?'''3@@>@A@>::?B?2?-?CC3<A??8?B@B#####
@HWI-ST225:523:D1AY5ACXX:8:1101:1566:2149 1:N:0:GTGGCC
GAAGCCAAAACCCTCGCAGTCTG
+
??@DDDDDHHHHHIGIHGHICHEGAHC<HHII9?FB3?F4C<FDD>0?9D*99*??D@;AA9=?'''3@@>@A@>::?B?2?-?CC3<A??8?B@B#####
@HWI-ST225:523:D1AY5ACXX:8:1101:2000:2191 1:N:0:GTGGCC
TTGCTTGTTGCGTGTCTCAGGCGGAAAAACGCCAAAATCAACCGCGATATACATTCCAAC
+
@@CFFFFFHHHHFGFIIJJJIIIHHIJJIIGGGGIEHICHAEDEHHHHADDB9?BEDDDBBDDDDBBB>DDC@@C@CCDBBD05>CCDACDDDDDDC:3>>
@HWI-ST225:523:D1AY5ACXX:8:1101:2000:2191 1:N:0:GTGGCC
GTCAGCTCTGAGCTGCTGATCT
+
@@CFFFFFHHHHFGFIIJJJIIIHHIJJIIGGGGIEHICHAEDEHHHHADDB9?BEDDDBBDDDDBBB>DDC@@C@CCDBBD05>CCDACDDDDDDC:3>>

You can do this regular expression replacement in any of the languages. I use PHP. Hope it's useful.

ADD COMMENTlink written 5.1 years ago by SRKR160

Hi, thanks for that, But I do need the linker sequence for further down-stream analysis. But I can still do it the same way, just without loosing the linkers.

ADD REPLYlink written 5.1 years ago by Assa Yeroslaviz1.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1724 users visited in the last hour