Question: Paired-End Sequence Preprocessing Effects On Downstream Processes
gravatar for Travis
8.9 years ago by
Travis2.8k wrote:

Hi all,

I am wondering if pre-processing of paired end reads (for example to trim low quality ends & filter resulting sequences beneath a given length threshold) will have knock on-effects on the downstream processes?

I am currently using BWA for alignment of paired-end Illumina reads followed by SAMTools mpileup. Someone recommended I include some pre-processing steps and I am worried that this may cause difficulties due to pairs of differing length or one member of a pair being removed. Another worry is that it could affect the removal of duplicates due to duplicate pairs being altered.

Can anyone share their thoughts/experience in the area and perhaps help in directing my considerations?

ADD COMMENTlink written 8.9 years ago by Travis2.8k
gravatar for brentp
8.9 years ago by
Salt Lake City, UT
brentp23k wrote:

Usually, the disparate read lengths between a pair will not matter. However, most aligners expect the reads to be in the same order so if reads are removed from one file and not another, there will be a problem.

You can filter the ends independently and then discard/separate reads that do not appear in both. Here is a simple and naive implementation of a script that uses fastx_toolkit to trim paired end reads by quality and adaptors and keep only those that remain paired.

ADD COMMENTlink written 8.9 years ago by brentp23k

@Brent: I have paired-end exome data. I have done fastx_* clipping but getting single end reads, do you consider single-end reads after QC for further analysis or discard it ?

ADD REPLYlink written 8.9 years ago by Khader Shameer18k

Thanks for the input. I am thinking that removal of duplicates will still be affected though? I currently perform this with Picard MarkDuplicates on my aligned BAM.

ADD REPLYlink written 8.9 years ago by Travis2.8k

@Travis anything downstream will be affected. Not sure how Picard finds duplicates by genomic location so it may still mark as duplicate reads of differing lengths mapping to the same location. Any one know the whether that's the case?

ADD REPLYlink written 8.9 years ago by brentp23k

@Khader: I'd also like to know what to do with those. I guess that depends on the application. It's certainly easier to not consider the single-end reads, but you might get more power....

ADD REPLYlink written 8.9 years ago by brentp23k

I will be in a similar situation to Khader soon. Does anyone know if BWA disregards the single end reads at the sampe stage?

ADD REPLYlink written 8.9 years ago by Travis2.8k
gravatar for Mitch Bekritsky
8.9 years ago by
Mitch Bekritsky1.2k
London, England
Mitch Bekritsky1.2k wrote:

I regularly map mate pairs of differing lengths using BWA and have no problem. When reads should be clipped entirely in my pre-alignment processing, I leave the header in, put an N in the sequence, a B in the quality, and move on.

It would look like this:


This keeps everything in files with mate pairs in the proper order and ensures that BWA won't try to map low quality reads.

w.r.t PCR duplicates, when I look for them on my own, both reads in the mate pair must have the same start site (implicitly also the same insert size). Requiring both mates to have the same stop site would not work if you're trimming reads, as two PCR duplicates can have different qualities at the 3' end and be trimmed to different lengths.

I rely on the fact that the chances of 2 reads in a mate pair having the same exact start positions is pretty low. There is certainly some percent of the time that my assumption is wrong, but with a frequency that doesn't seem to affect downstream analysis.

Here's a perl script that should do the N/B trick:


use strict;
use warnings;
use Getopt::Long;

my $fastq;
my $threshold;

    "fastq=s"   => \$fastq,
    "threshold:i"   => \$threshold,

$threshold ||= 0;

my $outfile = $1 if $fastq =~ m/(.*)\.(fastq|txt)/;
$outfile .= ".mod.txt";

open IN, '<', $fastq or die "Could not open $fastq: $!\n";
open OUT, '>', $outfile;

my $hdr;
my $seq;
my $qual;

while(my $line = <IN>)
    chomp $line;
    $hdr = $1 if $line =~ m/@(.*)/;

    $line = <IN>;
    chomp $line;
    $seq = $line;

    $line = <IN>;
    $line = <IN>;
    chomp $line;
    $qual = $line;

    ##Do trimming, other pre-processing

    $seq = "";

    print OUT "\@$hdr\n";
    if(length($seq) <= $threshold)
            print OUT "N\n+$hdr\nB\n";
            print OUT "$seq\n+$hdr\n$qual\n";

close IN;
close OUT;

print STDERR "Processed reads can be found in $outfile\n";
ADD COMMENTlink modified 8.9 years ago • written 8.9 years ago by Mitch Bekritsky1.2k

Here you go brentp. My original code was in C++ and did a lot of other pre-processing, so I stripped it down to just the N/B trick and switched it to perl.

You should be able to do any trimming you want in the while loop, then if the read is too short, it'll print a placeholder.

The only requirement for this script to do that is that the sequence must have a length shorter then $threshold. Feel free to change that to whatever you prefer.

ADD REPLYlink written 8.9 years ago by Mitch Bekritsky1.2k

@Travis: depending on what pre-processing you want to do, you may want to just go straight to BWA. If all you're looking to do is trim low quality sequence, then BWA can do that with the -q flag. I do a lot of pre-processing on my files, but I'm tailoring mine to a particular project in my lab.

ADD REPLYlink written 8.9 years ago by Mitch Bekritsky1.2k

Good idea. Do you have a script that does this?

ADD REPLYlink written 8.9 years ago by brentp23k

Upvoted for code. Though "##Do trimming, other pre-processing" is much of the rub.

ADD REPLYlink written 8.9 years ago by brentp23k

It would seem that preprocessing introduces some need for downstream DIY then... I am wondering also then, how acceptable is it to skip the preprocessing and just go straight to the BWA stage? Any thoughts?

ADD REPLYlink written 8.9 years ago by Travis2.8k

Aha...I misunderstood...I thought you already did the trimming and other pre-processing and were looking for how to maintain the order between two files. What kind of pre-processing are you interested in?

ADD REPLYlink written 8.9 years ago by Mitch Bekritsky1.2k

Thanks Mitch. Herein lies my problem. I don't have a real project yet and at the minute I'm trying to teach myself analysis and cover all bases. Not an easy task :) Your comments have been great.

ADD REPLYlink written 8.9 years ago by Travis2.8k

It's my pleasure to help! I will say that when I pre-process my reads, most of what I do that's not project-specific is just trimming BWA-style. Which (I think) means that once the concentration of low-quality sequence calls gets too high, it trims the rest of the read. Of course, BWA does this for you, so if that's all you'd like to do, you're in luck--less work for you ;)

ADD REPLYlink written 8.9 years ago by Mitch Bekritsky1.2k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2251 users visited in the last hour