Question

Illumina Trimming Algorithm

5

Entering edit mode

13.7 years ago

Science_Robot ★ 1.1k

Has anyone written a script to trim fastq files? I read in another post somewhere that these scripts are a dime a dozen. So, could we share them to prevent people from eternally reinventing the wheel?

Here's mine so far, which is a modified version of a modified version of Richard Mott's trimming algorithm used in CLC Genomics Workbench:

def mott(dna, **kwargs):
    ''' Trims a sequence using modified Richard Mott algorithm '''
    try:
        limit = kwargs['limit']
    except KeyError:
        limit = 0.05
    tseq = []
    S = 0
    pos = 0
    for q, n in zip(dna.qual, dna.seq):
        pos += 1
        #if q == 'B': continue
        Q = ord(q) - 64
        p = 10**(Q/-10.0)
        s = S
        S += limit - p
        if S < 0: S = 0
        #print '%-3s %-3s %-3s %-3s %4.3f %6.3f %7.3f' % (pos, n, q, Q, p, limit - p, S),
        if s > S:
            break
        else:
            tseq.append(n)
    dna.seq = ''.join(tseq)
    dna.qual = dna.qual[0:len(dna.seq)]
    return dna

I should mention that dna is an object I've created which has some properties: seq is the sequence and qual is the quality scores (their ascii representation). This algorithm only works for Illumina 1.5 PHRED scores.

I'm using it in a pipeline for metagenome analysis of Illumina data. I'm also writing a median sliding window algorithm to see if that works better.

python trimming illumina • 18k views

ADD COMMENT • link updated 2.5 years ago by Jiacheng ▴ 60 • written 13.7 years ago by Science_Robot ★ 1.1k

6

Entering edit mode

what's wrong with the stuff in fastx toolkit?

ADD REPLY • link 13.7 years ago by brentp 24k

3

Entering edit mode

I'd also suggest that whilst the code golf is fine, and posting code to ask for help/suggestions is fine, dumping code snippets here when there are a dozen implementations elsewhere is not what BioStar should be used for. Stick it on github.

ADD REPLY • link 13.7 years ago by User 59 13k

0

Entering edit mode

I don't know who Richard Mott is but it would be helpful if you describe algorithms in English, e.g. "this trims all base pairs to the right of the first position where quality drops more than 'limit'"

ADD REPLY • link 13.7 years ago by Jeremy Leipzig 22k

0

Entering edit mode

It would be better to choose an existing software to do Illumina trimming rather than reinventing the wheel. Considering up-to-date precision and speed benchmarks, I'd recommend atria.

ADD REPLY • link 2.5 years ago by Jiacheng ▴ 60

Ram · Answer 1 · 2011-04-27

Our group (The UC Davis Bioinformatics Core) has some internal tools we've developed. The first is Scythe, an adapter trimmer. This uses the quality information in a FASTQ entry and a prior to decide whether a 3' substring is adapter. Very basically, it takes a naïve Bayesian approach to classifying 3'-end contaminants only. Because these are the most poor quality bases and most likely to be contaminated (especially as reads get longer and longer), Scythe is designed to specifically remove these contaminants. Removing other contaminants can be done with other tools. The prior can be set to different thresholds; I recommend using less to get a sense of the 3'-end adapter contaminant rate. If you're doing an assembly, you may desire very, very strict trimming, i.e. if the adapter contamination seems high, and the 3'-end adapter begins with GATC, removing all 3'-end GAT, GA, and GATC substrings (as well as all the longer more likely matches). We find this works well in our projects, but and feedback is welcome.

[Sickle] is a sliding window quality trimmer, designed to be used after Scythe. Unlike cutadapt and other tools, our pipelines remove adapter contaminants before quality trimming, as removing poor quality bases throws away any useful information that could be used in identifying a 3'-end adapter contaminant. Thus, our quality control system works by first looking at preliminary quality checks via my qrqc package, then running Scythe to remove adapters, then do quality trimming with Sickle, then do another qrqc run to see how the sequences have changed.

Sometimes Scythe seems a bit greedy, but upon further inspection it's almost always "too" greedy with poor quality sequences, which would be removed by Sickle anyways.

Nik (a member in our group) also wrote Sabre, a bardcode demultiplexing and trimming too (I think this is still alpha though).

Any feedback would be appreciated!

Ram · Answer 2 · 2010-07-29

14

Entering edit mode

13.7 years ago

Neilfws 49k

A quick Google search for "fastq trim" throws up:

FASTX-Toolkit - which includes FASTQ/A Trimmer
Working with FASTQ files in Biopython - with concise sample code for trimming
Trimming adaptors from short read sequences - another Biopython example
Mail list discussion on trimming FASTQ using EMBOSS trimest
FASTQ manipulation in Galaxy - includes a trimmer

"A dime a dozen" sounds about right :-) So by all means, share here, but people should be aware that there are are plenty of solutions "out there".

ADD COMMENT • link updated 5.6 years ago by Ram 43k • written 13.7 years ago by Neilfws 49k

1

Entering edit mode

"Trimming" is a catch-all term for several types of operation, including removal of low quality bases, unwanted sequence (e.g adapter/primer/vector/barcodes) and homopolymers (e.g. poly A), or a combination of these.

A universal trimmer would need parameters for different quality encodings, trim position limits and sequence matching algorithms, not to mention being very fast and reliable. I think this helps to explain the explosion of "local" solutions.

Expect more of them if people start to move to BAM format for unaligned reads.

ADD REPLY • link 13.7 years ago by biobot 0.0.77.a.1099 6.2k

0

Entering edit mode

Fastx has uneven documentation. The FastQ/A trimmer (fastx_trimmer) is not quality based. The fastq_quality_trimmer, is quality based, but is not actually mentioned on this page.

ADD REPLY • link updated 5.6 years ago by Ram 43k • written 13.7 years ago by Jeremy Leipzig 22k

Ram · Answer 3 · 2012-01-17

As audyyy's code specifically does quality trimming, I will comment on the quality trimming only.

Phred uses the so-called modifed Richard Mott's algorithm. The Phred documentation gives the implementation details. It trims both ends. The BWA trimming algorithm largely learned from phred with modifications for trimming from one end only. For quality-based trimming, the Phred or the BWA algorithm is preferred. Both of them has a single user defined parameter. They are fast (one-pass scan of the quality string) and fairly easy to implement (the core algorithm in 10 lines of code).

Fastx trims bases with quality below a threshold. Although for Illumina this algorithm may be okay, it is not so robust as the phred or the bwa algorithm. There are also a variety of window-based trimming methods. The problem with these algorithms is that they have more parameters and thus less flexible. My feeling is they are also harder to implement if we want to achieve the speed and quality of the phred or bwa algorithm (although I have not tried).

I have just added quality based trimming in seqtk. It uses the Phred algorithm to trim both ends. If after trimming, the read length is below 30 (by default), it tries to find a 30bp window with the highest sum of quality.

Seqtk is much faster than the existing stand-alone trimmers. It takes 46 CPU seconds to trim a fastq with 100 million lines, while fastx_quality_trimmer takes more than 500 CPU seconds - over 10 times slower. A trimmer in Python/Perl should be even slower as scripting languages are usually very inefficient to loop through each character in a string. You may try.

Ram · Answer 4 · 2012-01-17

I wrote my own Perl subroutine. The arguments for the subroutine are the shuffled fastq file and a hash reference with options. The options are: numcpus, min_quality (for trimming only), bases_to_trim (max bases that will be trimmed on either side), min_avg_quality, and min_avg_quality_window (window of bases to average together). There are defaults for each of these options. It returns a reference to an array consisting of fastq paired entries (8 lines per array element).

The subroutine relies on two subroutines that you can write on your own: logmsg which prints out a status message, and LKUtils::getNumCPUs() which doesn't matter if you supply the number of CPUs.

So, I can't say that this is the best piece of work, and it lacks a way to take care of singletons like the UC Davis software. However, it looks like it is working, and it is kind of fast because of multithreading. Taking care of singletons in a large dataset could be a minor issue. Because I am not the best programmer ever and because I only tested it a little, any feedback is appreciated.