Trim Certain Nucleotides Off Fastq Reads
3
0
Entering edit mode
11.4 years ago
bbio ▴ 90

Let's say I have a fastq file with NGS reads and I want to trim off all trailing Cs, such that ACGTCCC becomes ACGT but ACGTCCG stays the same. I do not want to trim by quality or anything like that and the amount of Cs may differ from read to read. Are there any tools out there that are capable of doing this? Also, are there any potential problems I might run into when I then align these reads to a genome using BWA?

fastq next-gen trimming • 4.2k views
ADD COMMENT
0
Entering edit mode

If the poly-C tails are product of errors in sequencing (and maybe it's true), it's better to trim your reads before mapping.

ADD REPLY
2
Entering edit mode
11.4 years ago

I don't know of a tool but it is something that one could write in python/perl/ruby, here is an attempt to do it looks like it would take 10 seconds per million of reads:

import re, string, itertools

# splitting pattern
patt = re.compile("C+$")

# open stream
stream = file("data.fq")

# strip ending new lines
stream = itertools.imap(string.strip, stream)

# loop over the file four lines at a time
for id in stream:
    seq  = stream.next()
    tmp  = stream.next()
    qual = stream.next()

    # attempt to split the string at pattern
    seq = re.split(patt, seq)[0]

    # resize quality string to match sequence
    qual = qual[:len(seq)]

    # output the trimmed fastq
    print id
    print seq
    print tmp
    print qual
ADD COMMENT
0
Entering edit mode

benchmark with 1 million reads, 50b long, my Perl (v5.10.0) code vs your Python (v2.6.1) code

 $ time perl trim.pl < data.fq > out1
 real 0m10.211s
 user 0m6.857s
 sys 0m0.668s
 $ time python trim.py > out2
 real 0m15.798s
 user 0m13.552s
 sys 0m0.607s

Data used: https://dl.dropbox.com/u/386663/data.fq.gz (46 MB)

ADD REPLY
0
Entering edit mode

Tried to optimize this alas as I discovered python has innate disadvantage at the IO level. It seems that a simple code that strips each line and outputs it again runs almost three times slower in python than perl.

import sys
for line in sys.stdin:
    line = line.strip()
    print line

vs the equivalent:

while (<>) {
   chomp;
   print "$_\n";
}

will yield

$ time python ia2.py < data.fq  > a

real    0m4.222s
user    0m4.149s
sys     0m0.030s

$ time perl jc1.pl < data.fq  > b

real    0m1.554s
user    0m1.404s
sys     0m0.124s

so that's a starting disadvantage I cannot optimize nor can I overcome.

ADD REPLY
0
Entering edit mode

Great, thank you both very much! I probably should have just written it myself in the first place, I was just wondering if there was a proper tool for this out there already.

ADD REPLY
0
Entering edit mode

Well, Perl won this one, last time you and Matted crushed it.

ADD REPLY
2
Entering edit mode
11.4 years ago
JC 13k

Perl option:

#!/usr/bin/perl

use strict;
use warnings;

my $nl  = 0;
my $pos = 0;
my $len = 0;
while (<>) {
    chomp;
    $nl++;
    if ($nl == 2) {
        if (m/C+$/) {
            $pos = $-[0];
            $len = $+[0] - $-[0];
            substr($_, $pos, $len) = '';
        }
        else {
            $len = 0;
        }
    }
    elsif ($nl == 4) {
        if ($len > 0) {
            substr($_, $pos, $len) = '';
        }
        $nl = 0;
   }
   print "$_\n";
}
ADD COMMENT
1
Entering edit mode
8.4 years ago

Using BBDuk:

bbduk.sh in=reads.fq out=trimmed.fq ktrim=r k=31 mink=1 rcomp=f literal=CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC

real    0m1.464s
user    0m5.039s
sys     0m0.781s
ADD COMMENT

Login before adding your answer.

Traffic: 1467 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6