Question: Illumina Trimming Algorithm
gravatar for Science_Robot
10.4 years ago by
Gainesville, FL
Science_Robot1.1k wrote:

Has anyone written a script to trim fastq files? I read in another post somewhere that these scripts are a dime a dozen. So, could we share them to prevent people from eternally reinventing the wheel?

Here's mine so far, which is a modified version of a modified version of Richard Mott's trimming algorithm used in CLC Genomics Workbench:

def mott(dna, **kwargs):
    ''' Trims a sequence using modified Richard Mott algorithm '''
        limit = kwargs['limit']
    except KeyError:
        limit = 0.05
    tseq = []
    S = 0
    pos = 0
    for q, n in zip(dna.qual, dna.seq):
        pos += 1
        #if q == 'B': continue
        Q = ord(q) - 64
        p = 10**(Q/-10.0)
        s = S
        S += limit - p
        if S < 0: S = 0
        #print '%-3s %-3s %-3s %-3s %4.3f %6.3f %7.3f' % (pos, n, q, Q, p, limit - p, S),
        if s > S:
    dna.seq = ''.join(tseq)
    dna.qual = dna.qual[0:len(dna.seq)]
    return dna

I should mention that dna is an object I've created which has some properties: seq is the sequence and qual is the quality scores (their ascii representation). This algorithm only works for Illumina 1.5 PHRED scores.

I'm using it in a pipeline for metagenome analysis of Illumina data. I'm also writing a median sliding window algorithm to see if that works better.

illumina trimming python • 15k views
ADD COMMENTlink modified 2.2 years ago by _r_am31k • written 10.4 years ago by Science_Robot1.1k

what's wrong with the stuff in fastx toolkit?

ADD REPLYlink written 10.4 years ago by brentp23k

I'd also suggest that whilst the code golf is fine, and posting code to ask for help/suggestions is fine, dumping code snippets here when there are a dozen implementations elsewhere is not what BioStar should be used for. Stick it on github.

ADD REPLYlink written 10.4 years ago by Daniel Swan13k

I don't know who Richard Mott is but it would be helpful if you describe algorithms in English, e.g. "this trims all base pairs to the right of the first position where quality drops more than 'limit'"

ADD REPLYlink written 10.4 years ago by Jeremy Leipzig19k
gravatar for Vince Buffalo
9.6 years ago by
Vince Buffalo460
Davis, CA
Vince Buffalo460 wrote:

Our group (The UC Davis Bioinformatics Core) has some internal tools we've developed. The first is Scythe, an adapter trimmer. This uses the quality information in a FASTQ entry and a prior to decide whether a 3' substring is adapter. Very basically, it takes a naïve Bayesian approach to classifying 3'-end contaminants only. Because these are the most poor quality bases and most likely to be contaminated (especially as reads get longer and longer), Scythe is designed to specifically remove these contaminants. Removing other contaminants can be done with other tools. The prior can be set to different thresholds; I recommend using less to get a sense of the 3'-end adapter contaminant rate. If you're doing an assembly, you may desire very, very strict trimming, i.e. if the adapter contamination seems high, and the 3'-end adapter begins with GATC, removing all 3'-end GAT, GA, and GATC substrings (as well as all the longer more likely matches). We find this works well in our projects, but and feedback is welcome.

[Sickle] is a sliding window quality trimmer, designed to be used after Scythe. Unlike cutadapt and other tools, our pipelines remove adapter contaminants before quality trimming, as removing poor quality bases throws away any useful information that could be used in identifying a 3'-end adapter contaminant. Thus, our quality control system works by first looking at preliminary quality checks via my qrqc package, then running Scythe to remove adapters, then do quality trimming with Sickle, then do another qrqc run to see how the sequences have changed.

Sometimes Scythe seems a bit greedy, but upon further inspection it's almost always "too" greedy with poor quality sequences, which would be removed by Sickle anyways.

Nik (a member in our group) also wrote Sabre, a bardcode demultiplexing and trimming too (I think this is still alpha though).

Any feedback would be appreciated!

ADD COMMENTlink modified 2.2 years ago by _r_am31k • written 9.6 years ago by Vince Buffalo460

Is there a page with the whole workflow for paired end data in code or pseudocode? In other words, the code for scythe, then the code needed for sickle, (and maybe sabre), followed by an assembler, eg VelvetOptimiser?

ADD REPLYlink written 8.9 years ago by Lee Katz3.0k

Is there a page with the whole workflow for paired end data in code or pseudocode? In other words, the code for scythe, then the code needed for sickle, (and maybe sabre), followed by an assembler, eg VelvetOptimiser? Also some files that are mentioned like the adapter file are not well described, even though I have a good hint that it is a simple fasta file of the adapter sequences.

ADD REPLYlink written 8.9 years ago by Lee Katz3.0k
gravatar for Neilfws
10.4 years ago by
Sydney, Australia
Neilfws49k wrote:

A quick Google search for "fastq trim" throws up:

"A dime a dozen" sounds about right :-) So by all means, share here, but people should be aware that there are are plenty of solutions "out there".

ADD COMMENTlink modified 2.2 years ago by _r_am31k • written 10.4 years ago by Neilfws49k

"Trimming" is a catch-all term for several types of operation, including removal of low quality bases, unwanted sequence (e.g adapter/primer/vector/barcodes) and homopolymers (e.g. poly A), or a combination of these.

A universal trimmer would need parameters for different quality encodings, trim position limits and sequence matching algorithms, not to mention being very fast and reliable. I think this helps to explain the explosion of "local" solutions.

Expect more of them if people start to move to BAM format for unaligned reads.

ADD REPLYlink written 10.4 years ago by biobot 0.0.77.a.10996.1k

Fastx has uneven documentation. The FastQ/A trimmer (fastx_trimmer) is not quality based. The fastq_quality_trimmer, is quality based, but is not actually mentioned on this page.

ADD REPLYlink modified 2.2 years ago by _r_am31k • written 10.4 years ago by Jeremy Leipzig19k
gravatar for lh3
8.9 years ago by
United States
lh332k wrote:

As audyyy's code specifically does quality trimming, I will comment on the quality trimming only.

Phred uses the so-called modifed Richard Mott's algorithm. The Phred documentation gives the implementation details. It trims both ends. The BWA trimming algorithm largely learned from phred with modifications for trimming from one end only. For quality-based trimming, the Phred or the BWA algorithm is preferred. Both of them has a single user defined parameter. They are fast (one-pass scan of the quality string) and fairly easy to implement (the core algorithm in 10 lines of code).

Fastx trims bases with quality below a threshold. Although for Illumina this algorithm may be okay, it is not so robust as the phred or the bwa algorithm. There are also a variety of window-based trimming methods. The problem with these algorithms is that they have more parameters and thus less flexible. My feeling is they are also harder to implement if we want to achieve the speed and quality of the phred or bwa algorithm (although I have not tried).

I have just added quality based trimming in seqtk. It uses the Phred algorithm to trim both ends. If after trimming, the read length is below 30 (by default), it tries to find a 30bp window with the highest sum of quality.

Seqtk is much faster than the existing stand-alone trimmers. It takes 46 CPU seconds to trim a fastq with 100 million lines, while fastx_quality_trimmer takes more than 500 CPU seconds - over 10 times slower. A trimmer in Python/Perl should be even slower as scripting languages are usually very inefficient to loop through each character in a string. You may try.

ADD COMMENTlink modified 2.2 years ago by _r_am31k • written 8.9 years ago by lh332k
gravatar for Lee Katz
8.9 years ago by
Lee Katz3.0k
Atlanta, GA
Lee Katz3.0k wrote:

I wrote my own Perl subroutine. The arguments for the subroutine are the shuffled fastq file and a hash reference with options. The options are: numcpus, min_quality (for trimming only), bases_to_trim (max bases that will be trimmed on either side), min_avg_quality, and min_avg_quality_window (window of bases to average together). There are defaults for each of these options. It returns a reference to an array consisting of fastq paired entries (8 lines per array element).

The subroutine relies on two subroutines that you can write on your own: logmsg which prints out a status message, and LKUtils::getNumCPUs() which doesn't matter if you supply the number of CPUs.

So, I can't say that this is the best piece of work, and it lacks a way to take care of singletons like the UC Davis software. However, it looks like it is working, and it is kind of fast because of multithreading. Taking care of singletons in a large dataset could be a minor issue. Because I am not the best programmer ever and because I only tested it a little, any feedback is appreciated.

ADD COMMENTlink modified 2.2 years ago by _r_am31k • written 8.9 years ago by Lee Katz3.0k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1134 users visited in the last hour