Question

Fastq Extract Under-Represented Sequences

1

Entering edit mode

11.3 years ago

daniel.soronellas ▴ 330

Hi everyone,

I have a fastq file sample which has almost 57% of over-representation of different sequences related to a Linker which means that something went wrong when the library was performed. But above this, I want to retrieve the almost 40% left to perform a FASTQc analysis and know if library preparation with the other linkers went properly.

So, do you know if there is any tools or even any script that could extract sequences that not correspont to the linker which is giving me this over-representation?

The linker that went wrong has the following sequence: TCGTAT.............TTG

Thanks,

fastqc • 1.7k views

ADD COMMENT • link updated 11.3 years ago by JC 13k • written 11.3 years ago by daniel.soronellas ▴ 330

score 0 · Answer 1 · 2012-12-19

If the linker is expect to be always as the first bases in your read sequence, you can use a Perl script like this:

#!/usr/bin/perl

use strict;
use warnings;

my $nl = 0;
my $match = 'TCGTAT.............TTG';
my ($read, $seql);

while (<>) {
    $nl ++;
    $read .= $_;
    if ($nl == 2) { 
        $seq = $_; 
    }
    elsif ($nl == 4) { 
         print $read unless ($seq =~ m/^$match/);
         $read = '';
         $nl = 0;
    }
}

then you can run:

perl filter.pl < file.fq > filtered.fq