Question: Whole-Exome Data Cleanup / Filtering
2
gravatar for Khader Shameer
8.5 years ago by
Manhattan, NY
Khader Shameer18k wrote:

What tools do you you use to clean up your whole-exome paired end reads: I am running a FastQC check for various parameters followed by a set of fastx_* commands as follows:

cat fastq_maq_ill2sanger/P1_R1_sanger.fastq fastqc|\
fastx_clipper -Q 33 -l 20 -v -a ACACTCTTTCCCTACACGACGCTCTTCCGATCT |\
fastx_clipper -Q 33 -l 20 -v -a CGGTCTCGGCATTCCTACTGAACCGCTCTTCCGATCT |\
fastx_clipper -Q 33 -l 20 -v -a ATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATC |\
fastx_clipper -Q 33 -l 20 -v -a CAAGCAGAAGACGGCATACGAGATCGGTCTCGGCATTCCTGCTGAACCGCTCTTCCGATC |\
fastq_quality_trimmer -Q 33 -t 20 -l 20 -v |\
fastx_artifacts_filter -Q 33 -v |\
fastq_quality_filter -Q 33 -q 20 -p 50 -v -o fastx_out_clean/P1_R1_fastx.fastq;

Then again run the FastQC to make sure the reads are fine.

Is this looks adequate for whole exome paired end reads ? Am I missing any other parameters or options ? This step takes around 2 hours in our server per sample, Is there any faster methods to do the clean up.

exome next-gen fastx sequencing • 3.5k views
ADD COMMENTlink modified 3.4 years ago by Biostar ♦♦ 20 • written 8.5 years ago by Khader Shameer18k

2 hours per read? :-) hope you don't have any deadlines in the next few hundred years.

ADD REPLYlink written 8.5 years ago by brentp23k

Oops. I mean per "sample" :), now corrected in the question. !

ADD REPLYlink written 8.5 years ago by Khader Shameer18k

2 CPU hours? I think that is a reasonable speed, at least way faster than mapping.

ADD REPLYlink written 8.5 years ago by lh331k

It only takes 2 hours for your samples? We do something very similar (fastqc, fastx_clipper, fastq_quality_trimmer, fastqc again), and I've found that for our samples these steps can take up to 3X longer than alignment with BWA. It's on the order of 12 hours per sample for our Hiseq read sets.

ADD REPLYlink written 8.4 years ago by Jeremy Schwartzentruber20
3
gravatar for Louis Letourneau
8.4 years ago by
Montreal
Louis Letourneau800 wrote:

We do the same thing but add a pairFixer script. Basically it goes through the pair files twice to fix the pairs. The first time to see all the headers(just the common part) that are seen twice and a second time to create these files: pair1 pair2 singles

I also added a -M12 to the clipper part to tolerate non-full length adapters but maybe 12 is a bit too short.

[HTML][HTML]

!/usr/bin/perl

use strict;

my $outName = $ARGV[0]; my $pair1 = $ARGV[1]; my $pair2 = $ARGV[2]; my $pair1FH; my $pair2FH; my $pair1outFH; my $pair2outFH; my $pairSingleFH;

open ($pair1FH, '-|', 'zcat', "$pair1") || die "Cannot Open pair 1 File"; open ($pair2FH, '-|', 'zcat', "$pair2") || die "Cannot Open pair 2 File"; my @pairedFiles = ($pair1FH,$pair2FH);

my %readCount; my $totalRecords = 0; for my $pairFH (@pairedFiles) { while (my $head = <$pairFH>) { <$pairFH>; <$pairFH>; <$pairFH>; $totalRecords++; my $key = substr($head, 0, length($head)-3); my $value = $readCount{$key}; if(defined $value) { $value++; $readCount{$key} = $value; } else { $readCount{$key} = 1; } } close($pairFH); }

open ($pair1FH, '-|', 'zcat', "$pair1") || die "Cannot Open pair 1 File"; open ($pair2FH, '-|', 'zcat', "$pair2") || die "Cannot Open pair 2 File"; @pairedFiles = ($pair1FH,$pair2FH); open ($pair1outFH, ">$outName.pair1.fastq") || die "Cannot Open pair out 1 File"; open ($pair2outFH, ">$outName.pair2.fastq") || die "Cannot Open pair out 2 File"; open ($pairSingleFH, ">$outName.single.fastq") || die "Cannot Open single File"; my @outputFiles = ($pair1outFH,$pair2outFH);

my $idx=0; my $records=0; my $pairs=0; my $single=0; print "Copying reads\n"; for my $pairFH (@pairedFiles) { my $outFH = $outputFiles[$idx]; $idx++;

while(my $head = <$pairFH>) { my $seq = <$pairFH>; my $head2 = <$pairFH>; my $qual = <$pairFH>;

my $key = substr($head, 0, length($head)-3);
my $value = $readCount{$key};
my $fh;
$records++;
if($value == 1) {
  $fh = $pairSingleFH;
  $single++;
}
elsif($value > 1) {
  $fh = $outFH;
  $pairs++;
}
else {
  die "No value found for: $key\n";
}

print $fh $head;
print $fh $seq;
print $fh $head2;
print $fh $qual;

print "\rTotal: $totalRecords, Records: $records, Pairs: $pairs, Single: $single ";

} close($outFH); close($pairFH); } close($pairSingleFH); print "Total: $totalRecords, Records: $records, Pairs: $pairs, Single: $single\n"; [HTML] [HTML]

ADD COMMENTlink modified 8.4 years ago • written 8.4 years ago by Louis Letourneau800
1

I put it directly in the post. Be aware that it's not optimized in the least and uses ram to keep the readnames + seen count.

There was a similar discussion in the velvet mailing list in which many other tools came out.

ADD REPLYlink modified 21 days ago by RamRS24k • written 8.4 years ago by Louis Letourneau800

Thanks Louis. Could you please post a link to pairFixer script.

ADD REPLYlink written 8.4 years ago by Khader Shameer18k
2
gravatar for brentp
8.5 years ago by
brentp23k
Salt Lake City, UT
brentp23k wrote:

Haibao Tang has a program that will accept a FASTA file of adaptors and trim those all in one go. It does some sort of local alignment on the adaptors so it will be more thorough tan the fastx_clipper and it can also trim low quality reads.

ADD COMMENTlink written 8.5 years ago by brentp23k

Thanks Brent !!

ADD REPLYlink written 8.5 years ago by Khader Shameer18k
1
gravatar for Jeremy Leipzig
8.5 years ago by
Philadelphia, PA
Jeremy Leipzig18k wrote:

i think those fastx_clipper commands are not pair-friendly - i.e. if one of your sequences is "all adapter" you will orphan the mate and mess up the pairing.

ADD COMMENTlink written 8.5 years ago by Jeremy Leipzig18k

That's a valid point Jeremy, I am getting "weird pairing" warning while running bwa sampe using fastx_* processed fastq files.

ADD REPLYlink written 8.5 years ago by Khader Shameer18k

Actually I do not think it's fastx's fault. What do you get if you just feed the raw reads to bwa?

ADD REPLYlink written 8.5 years ago by lh331k

@lh3: raw reads runs without problem. I am not able to run bwa sampe (v 0.5.9-r16) using the fastx_ generated fastq files. bwa is keep running for several hours without creating any output. I found this post and tried with an older version (v 0.5.5), then it was running but showing the weird pairing. I tried bwa sampe runs several time, once it ran for about 24 hrs, finally I quit the run. Because of this issue am now re-checking my previous steps. Any thoughts on why bwa sampe not running on fastx_ ?

ADD REPLYlink written 8.5 years ago by Khader Shameer18k

I see. If without fastx* bwa runs fine, it should be fastx's problem. I have never used fastx, so do not know what was happening.

ADD REPLYlink written 8.5 years ago by lh331k

Jeremy: Do you have any suggestion on a program that can deal with the pairing issues here ?

ADD REPLYlink written 8.4 years ago by Khader Shameer18k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2255 users visited in the last hour