I regularly map mate pairs of differing lengths using BWA and have no problem. When reads should be clipped entirely in my pre-alignment processing, I leave the header in, put an N in the sequence, a B in the quality, and move on.
It would look like this:
@ReallyBadSeq
N
+ReallyBadSeq
B
This keeps everything in files with mate pairs in the proper order and ensures that BWA won't try to map low quality reads.
w.r.t PCR duplicates, when I look for them on my own, both reads in the mate pair must have the same start site (implicitly also the same insert size). Requiring both mates to have the same stop site would not work if you're trimming reads, as two PCR duplicates can have different qualities at the 3' end and be trimmed to different lengths.
I rely on the fact that the chances of 2 reads in a mate pair having the same exact start positions is pretty low. There is certainly some percent of the time that my assumption is wrong, but with a frequency that doesn't seem to affect downstream analysis.
Here's a perl script that should do the N/B trick:
#!/usr/bin/perl
use strict;
use warnings;
use Getopt::Long;
my $fastq;
my $threshold;
GetOptions
(
"fastq=s" => \$fastq,
"threshold:i" => \$threshold,
);
$threshold ||= 0;
my $outfile = $1 if $fastq =~ m/(.*)\.(fastq|txt)/;
$outfile .= ".mod.txt";
open IN, '<', $fastq or die "Could not open $fastq: $!\n";
open OUT, '>', $outfile;
my $hdr;
my $seq;
my $qual;
while(my $line = <IN>)
{
chomp $line;
$hdr = $1 if $line =~ m/@(.*)/;
$line = <IN>;
chomp $line;
$seq = $line;
$line = <IN>;
$line = <IN>;
chomp $line;
$qual = $line;
##Do trimming, other pre-processing
$seq = "";
print OUT "\@$hdr\n";
if(length($seq) <= $threshold)
{
print OUT "N\n+$hdr\nB\n";
}
else
{
print OUT "$seq\n+$hdr\n$qual\n";
}
}
close IN;
close OUT;
print STDERR "Processed reads can be found in $outfile\n";
@Brent: I have paired-end exome data. I have done fastx_* clipping but getting single end reads, do you consider single-end reads after QC for further analysis or discard it ?
Thanks for the input. I am thinking that removal of duplicates will still be affected though? I currently perform this with Picard MarkDuplicates on my aligned BAM.
@Travis anything downstream will be affected. Not sure how Picard finds duplicates by genomic location so it may still mark as duplicate reads of differing lengths mapping to the same location. Any one know the whether that's the case?
@Khader: I'd also like to know what to do with those. I guess that depends on the application. It's certainly easier to not consider the single-end reads, but you might get more power....
I will be in a similar situation to Khader soon. Does anyone know if BWA disregards the single end reads at the sampe stage?