Paired-End Sequence Preprocessing Effects On Downstream Processes
2
3
Entering edit mode
11.7 years ago
Travis ★ 2.8k

Hi all,

I am wondering if pre-processing of paired end reads (for example to trim low quality ends & filter resulting sequences beneath a given length threshold) will have knock on-effects on the downstream processes?

I am currently using BWA for alignment of paired-end Illumina reads followed by SAMTools mpileup. Someone recommended I include some pre-processing steps and I am worried that this may cause difficulties due to pairs of differing length or one member of a pair being removed. Another worry is that it could affect the removal of duplicates due to duplicate pairs being altered.

Can anyone share their thoughts/experience in the area and perhaps help in directing my considerations?

next-gen sequencing alignment paired • 6.2k views
4
Entering edit mode
11.7 years ago
brentp 24k

Usually, the disparate read lengths between a pair will not matter. However, most aligners expect the reads to be in the same order so if reads are removed from one file and not another, there will be a problem.

You can filter the ends independently and then discard/separate reads that do not appear in both. Here is a simple and naive implementation of a script that uses fastx_toolkit to trim paired end reads by quality and adaptors and keep only those that remain paired.

2
Entering edit mode

@Brent: I have paired-end exome data. I have done fastx_* clipping but getting single end reads, do you consider single-end reads after QC for further analysis or discard it ?

1
Entering edit mode

Thanks for the input. I am thinking that removal of duplicates will still be affected though? I currently perform this with Picard MarkDuplicates on my aligned BAM.

0
Entering edit mode

@Travis anything downstream will be affected. Not sure how Picard finds duplicates by genomic location so it may still mark as duplicate reads of differing lengths mapping to the same location. Any one know the whether that's the case?

0
Entering edit mode

@Khader: I'd also like to know what to do with those. I guess that depends on the application. It's certainly easier to not consider the single-end reads, but you might get more power....

0
Entering edit mode

I will be in a similar situation to Khader soon. Does anyone know if BWA disregards the single end reads at the sampe stage?

2
Entering edit mode
11.7 years ago
Mitch Bekritsky ★ 1.3k

I regularly map mate pairs of differing lengths using BWA and have no problem. When reads should be clipped entirely in my pre-alignment processing, I leave the header in, put an N in the sequence, a B in the quality, and move on.

It would look like this:

N
B

This keeps everything in files with mate pairs in the proper order and ensures that BWA won't try to map low quality reads.

w.r.t PCR duplicates, when I look for them on my own, both reads in the mate pair must have the same start site (implicitly also the same insert size). Requiring both mates to have the same stop site would not work if you're trimming reads, as two PCR duplicates can have different qualities at the 3' end and be trimmed to different lengths.

I rely on the fact that the chances of 2 reads in a mate pair having the same exact start positions is pretty low. There is certainly some percent of the time that my assumption is wrong, but with a frequency that doesn't seem to affect downstream analysis.

Here's a perl script that should do the N/B trick:

#!/usr/bin/perl

use strict;
use warnings;
use Getopt::Long;

my $fastq; my$threshold;

GetOptions
(
"fastq=s"   => \$fastq, "threshold:i" => \$threshold,
);

$threshold ||= 0; my$outfile = $1 if$fastq =~ m/(.*)\.(fastq|txt)/;
$outfile .= ".mod.txt"; open IN, '<',$fastq or die "Could not open $fastq:$!\n";
open OUT, '>', $outfile; my$hdr;
my $seq; my$qual;

while(my $line = <IN>) { chomp$line;
$hdr =$1 if $line =~ m/@(.*)/;$line = <IN>;
chomp $line;$seq = $line;$line = <IN>;
$line = <IN>; chomp$line;
$qual =$line;

##Do trimming, other pre-processing

$seq = ""; print OUT "\@$hdr\n";
if(length($seq) <=$threshold)
{
print OUT "N\n+$hdr\nB\n"; } else { print OUT "$seq\n+$hdr\n$qual\n";
}
}

close IN;
close OUT;

print STDERR "Processed reads can be found in $outfile\n";  ADD COMMENT 1 Entering edit mode Here you go brentp. My original code was in C++ and did a lot of other pre-processing, so I stripped it down to just the N/B trick and switched it to perl. You should be able to do any trimming you want in the while loop, then if the read is too short, it'll print a placeholder. The only requirement for this script to do that is that the sequence must have a length shorter then$threshold. Feel free to change that to whatever you prefer.

1
Entering edit mode

@Travis: depending on what pre-processing you want to do, you may want to just go straight to BWA. If all you're looking to do is trim low quality sequence, then BWA can do that with the -q flag. I do a lot of pre-processing on my files, but I'm tailoring mine to a particular project in my lab.

0
Entering edit mode

Good idea. Do you have a script that does this?

0
Entering edit mode

Upvoted for code. Though "##Do trimming, other pre-processing" is much of the rub.

0
Entering edit mode

It would seem that preprocessing introduces some need for downstream DIY then... I am wondering also then, how acceptable is it to skip the preprocessing and just go straight to the BWA stage? Any thoughts?

0
Entering edit mode

Aha...I misunderstood...I thought you already did the trimming and other pre-processing and were looking for how to maintain the order between two files. What kind of pre-processing are you interested in?

0
Entering edit mode

Thanks Mitch. Herein lies my problem. I don't have a real project yet and at the minute I'm trying to teach myself analysis and cover all bases. Not an easy task :) Your comments have been great.

0
Entering edit mode

It's my pleasure to help! I will say that when I pre-process my reads, most of what I do that's not project-specific is just trimming BWA-style. Which (I think) means that once the concentration of low-quality sequence calls gets too high, it trims the rest of the read. Of course, BWA does this for you, so if that's all you'd like to do, you're in luck--less work for you ;)