Has anyone written a script to trim fastq files? I read in another post somewhere that these scripts are a dime a dozen. So, could we share them to prevent people from eternally reinventing the wheel?
Here's mine so far, which is a modified version of a modified version of Richard Mott's trimming algorithm used in CLC Genomics Workbench:
def mott(dna, **kwargs): ''' Trims a sequence using modified Richard Mott algorithm ''' try: limit = kwargs['limit'] except KeyError: limit = 0.05 tseq =  S = 0 pos = 0 for q, n in zip(dna.qual, dna.seq): pos += 1 #if q == 'B': continue Q = ord(q) - 64 p = 10**(Q/-10.0) s = S S += limit - p if S < 0: S = 0 #print '%-3s %-3s %-3s %-3s %4.3f %6.3f %7.3f' % (pos, n, q, Q, p, limit - p, S), if s > S: break else: tseq.append(n) dna.seq = ''.join(tseq) dna.qual = dna.qual[0:len(dna.seq)] return dna
I should mention that
dna is an object I've created which has some properties:
seq is the sequence and
qual is the quality scores (their ascii representation). This algorithm only works for Illumina 1.5 PHRED scores.
I'm using it in a pipeline for metagenome analysis of Illumina data. I'm also writing a median sliding window algorithm to see if that works better.