Fastq Extract Under-Represented Sequences
1
1
Entering edit mode
11.3 years ago

Hi everyone,

I have a fastq file sample which has almost 57% of over-representation of different sequences related to a Linker which means that something went wrong when the library was performed. But above this, I want to retrieve the almost 40% left to perform a FASTQc analysis and know if library preparation with the other linkers went properly.

So, do you know if there is any tools or even any script that could extract sequences that not correspont to the linker which is giving me this over-representation?

The linker that went wrong has the following sequence: TCGTAT.............TTG

Thanks,

fastqc • 1.7k views
ADD COMMENT
0
Entering edit mode
11.3 years ago
JC 13k

If the linker is expect to be always as the first bases in your read sequence, you can use a Perl script like this:

#!/usr/bin/perl

use strict;
use warnings;

my $nl = 0;
my $match = 'TCGTAT.............TTG';
my ($read, $seql);

while (<>) {
    $nl ++;
    $read .= $_;
    if ($nl == 2) { 
        $seq = $_; 
    }
    elsif ($nl == 4) { 
         print $read unless ($seq =~ m/^$match/);
         $read = '';
         $nl = 0;
    }
}

then you can run:

perl filter.pl < file.fq > filtered.fq
ADD COMMENT

Login before adding your answer.

Traffic: 2687 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6