Whole-Exome Data Cleanup / Filtering
3
2
Entering edit mode
13.0 years ago

What tools do you you use to clean up your whole-exome paired end reads: I am running a FastQC check for various parameters followed by a set of fastx_* commands as follows:

cat fastq_maq_ill2sanger/P1_R1_sanger.fastq fastqc|\
fastx_clipper -Q 33 -l 20 -v -a ACACTCTTTCCCTACACGACGCTCTTCCGATCT |\
fastx_clipper -Q 33 -l 20 -v -a CGGTCTCGGCATTCCTACTGAACCGCTCTTCCGATCT |\
fastx_clipper -Q 33 -l 20 -v -a ATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATC |\
fastx_clipper -Q 33 -l 20 -v -a CAAGCAGAAGACGGCATACGAGATCGGTCTCGGCATTCCTGCTGAACCGCTCTTCCGATC |\
fastq_quality_trimmer -Q 33 -t 20 -l 20 -v |\
fastx_artifacts_filter -Q 33 -v |\
fastq_quality_filter -Q 33 -q 20 -p 50 -v -o fastx_out_clean/P1_R1_fastx.fastq;

Then again run the FastQC to make sure the reads are fine.

Is this looks adequate for whole exome paired end reads ? Am I missing any other parameters or options ? This step takes around 2 hours in our server per sample, Is there any faster methods to do the clean up.

exome next-gen sequencing fastx • 5.1k views
ADD COMMENT
0
Entering edit mode

2 hours per read? :-) hope you don't have any deadlines in the next few hundred years.

ADD REPLY
0
Entering edit mode

Oops. I mean per "sample" :), now corrected in the question. !

ADD REPLY
0
Entering edit mode

2 CPU hours? I think that is a reasonable speed, at least way faster than mapping.

ADD REPLY
0
Entering edit mode

It only takes 2 hours for your samples? We do something very similar (fastqc, fastx_clipper, fastq_quality_trimmer, fastqc again), and I've found that for our samples these steps can take up to 3X longer than alignment with BWA. It's on the order of 12 hours per sample for our Hiseq read sets.

ADD REPLY
3
Entering edit mode
13.0 years ago

We do the same thing but add a pairFixer script. Basically it goes through the pair files twice to fix the pairs. The first time to see all the headers(just the common part) that are seen twice and a second time to create these files: pair1 pair2 singles

I also added a -M12 to the clipper part to tolerate non-full length adapters but maybe 12 is a bit too short.

[HTML][HTML]

!/usr/bin/perl

use strict;

my $outName = $ARGV[0]; my $pair1 = $ARGV[1]; my $pair2 = $ARGV[2]; my $pair1FH; my $pair2FH; my $pair1outFH; my $pair2outFH; my $pairSingleFH;

open ($pair1FH, '-|', 'zcat', "$pair1") || die "Cannot Open pair 1 File"; open ($pair2FH, '-|', 'zcat', "$pair2") || die "Cannot Open pair 2 File"; my @pairedFiles = ($pair1FH,$pair2FH);

my %readCount; my $totalRecords = 0; for my $pairFH (@pairedFiles) { while (my $head = <$pairFH>) { <$pairFH>; <$pairFH>; <$pairFH>; $totalRecords++; my $key = substr($head, 0, length($head)-3); my $value = $readCount{$key}; if(defined $value) { $value++; $readCount{$key} = $value; } else { $readCount{$key} = 1; } } close($pairFH); }

open ($pair1FH, '-|', 'zcat', "$pair1") || die "Cannot Open pair 1 File"; open ($pair2FH, '-|', 'zcat', "$pair2") || die "Cannot Open pair 2 File"; @pairedFiles = ($pair1FH,$pair2FH); open ($pair1outFH, ">$outName.pair1.fastq") || die "Cannot Open pair out 1 File"; open ($pair2outFH, ">$outName.pair2.fastq") || die "Cannot Open pair out 2 File"; open ($pairSingleFH, ">$outName.single.fastq") || die "Cannot Open single File"; my @outputFiles = ($pair1outFH,$pair2outFH);

my $idx=0; my $records=0; my $pairs=0; my $single=0; print "Copying reads\n"; for my $pairFH (@pairedFiles) { my $outFH = $outputFiles[$idx]; $idx++;

while(my $head = <$pairFH>) { my $seq = <$pairFH>; my $head2 = <$pairFH>; my $qual = <$pairFH>;

my $key = substr($head, 0, length($head)-3);
my $value = $readCount{$key};
my $fh;
$records++;
if($value == 1) {
  $fh = $pairSingleFH;
  $single++;
}
elsif($value > 1) {
  $fh = $outFH;
  $pairs++;
}
else {
  die "No value found for: $key\n";
}

print $fh $head;
print $fh $seq;
print $fh $head2;
print $fh $qual;

print "\rTotal: $totalRecords, Records: $records, Pairs: $pairs, Single: $single ";

} close($outFH); close($pairFH); } close($pairSingleFH); print "Total: $totalRecords, Records: $records, Pairs: $pairs, Single: $single\n"; [HTML] [HTML]

ADD COMMENT
1
Entering edit mode

I put it directly in the post. Be aware that it's not optimized in the least and uses ram to keep the readnames + seen count.

There was a similar discussion in the velvet mailing list in which many other tools came out.

ADD REPLY
0
Entering edit mode

Thanks Louis. Could you please post a link to pairFixer script.

ADD REPLY
2
Entering edit mode
13.0 years ago
brentp 24k

Haibao Tang has a program that will accept a FASTA file of adaptors and trim those all in one go. It does some sort of local alignment on the adaptors so it will be more thorough tan the fastx_clipper and it can also trim low quality reads.

ADD COMMENT
0
Entering edit mode

Thanks Brent !!

ADD REPLY
1
Entering edit mode
13.0 years ago

i think those fastx_clipper commands are not pair-friendly - i.e. if one of your sequences is "all adapter" you will orphan the mate and mess up the pairing.

ADD COMMENT
0
Entering edit mode

That's a valid point Jeremy, I am getting "weird pairing" warning while running bwa sampe using fastx_* processed fastq files.

ADD REPLY
0
Entering edit mode

Actually I do not think it's fastx's fault. What do you get if you just feed the raw reads to bwa?

ADD REPLY
0
Entering edit mode

@lh3: raw reads runs without problem. I am not able to run bwa sampe (v 0.5.9-r16) using the fastx_ generated fastq files. bwa is keep running for several hours without creating any output. I found this post and tried with an older version (v 0.5.5), then it was running but showing the weird pairing. I tried bwa sampe runs several time, once it ran for about 24 hrs, finally I quit the run. Because of this issue am now re-checking my previous steps. Any thoughts on why bwa sampe not running on fastx_ ?

ADD REPLY
0
Entering edit mode

I see. If without fastx* bwa runs fine, it should be fastx's problem. I have never used fastx, so do not know what was happening.

ADD REPLY
0
Entering edit mode

Jeremy: Do you have any suggestion on a program that can deal with the pairing issues here ?

ADD REPLY

Login before adding your answer.

Traffic: 2145 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6