Whole-Exome Data Cleanup / Filtering
3
2
Entering edit mode
11.6 years ago

What tools do you you use to clean up your whole-exome paired end reads: I am running a FastQC check for various parameters followed by a set of fastx_* commands as follows:

cat fastq_maq_ill2sanger/P1_R1_sanger.fastq fastqc|\
fastx_clipper -Q 33 -l 20 -v -a ACACTCTTTCCCTACACGACGCTCTTCCGATCT |\
fastx_clipper -Q 33 -l 20 -v -a CGGTCTCGGCATTCCTACTGAACCGCTCTTCCGATCT |\
fastx_clipper -Q 33 -l 20 -v -a ATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATC |\
fastx_clipper -Q 33 -l 20 -v -a CAAGCAGAAGACGGCATACGAGATCGGTCTCGGCATTCCTGCTGAACCGCTCTTCCGATC |\
fastq_quality_trimmer -Q 33 -t 20 -l 20 -v |\
fastx_artifacts_filter -Q 33 -v |\
fastq_quality_filter -Q 33 -q 20 -p 50 -v -o fastx_out_clean/P1_R1_fastx.fastq;


Then again run the FastQC to make sure the reads are fine.

Is this looks adequate for whole exome paired end reads ? Am I missing any other parameters or options ? This step takes around 2 hours in our server per sample, Is there any faster methods to do the clean up.

exome next-gen sequencing fastx • 4.3k views
0
Entering edit mode

2 hours per read? :-) hope you don't have any deadlines in the next few hundred years.

0
Entering edit mode

Oops. I mean per "sample" :), now corrected in the question. !

0
Entering edit mode

2 CPU hours? I think that is a reasonable speed, at least way faster than mapping.

0
Entering edit mode

It only takes 2 hours for your samples? We do something very similar (fastqc, fastx_clipper, fastq_quality_trimmer, fastqc again), and I've found that for our samples these steps can take up to 3X longer than alignment with BWA. It's on the order of 12 hours per sample for our Hiseq read sets.

3
Entering edit mode
11.6 years ago

We do the same thing but add a pairFixer script. Basically it goes through the pair files twice to fix the pairs. The first time to see all the headers(just the common part) that are seen twice and a second time to create these files: pair1 pair2 singles

I also added a -M12 to the clipper part to tolerate non-full length adapters but maybe 12 is a bit too short.

[HTML][HTML]

# !/usr/bin/perl

use strict;

my $outName =$ARGV[0]; my $pair1 =$ARGV[1]; my $pair2 =$ARGV[2]; my $pair1FH; my$pair2FH; my $pair1outFH; my$pair2outFH; my $pairSingleFH; open ($pair1FH, '-|', 'zcat', "$pair1") || die "Cannot Open pair 1 File"; open ($pair2FH, '-|', 'zcat', "$pair2") || die "Cannot Open pair 2 File"; my @pairedFiles = ($pair1FH,$pair2FH); my %readCount; my$totalRecords = 0; for my $pairFH (@pairedFiles) { while (my$head = <$pairFH>) { <$pairFH>; <$pairFH>; <$pairFH>; $totalRecords++; my$key = substr($head, 0, length($head)-3); my $value =$readCount{$key}; if(defined$value) { $value++;$readCount{$key} =$value; } else { $readCount{$key} = 1; } } close($pairFH); } open ($pair1FH, '-|', 'zcat', "$pair1") || die "Cannot Open pair 1 File"; open ($pair2FH, '-|', 'zcat', "$pair2") || die "Cannot Open pair 2 File"; @pairedFiles = ($pair1FH,$pair2FH); open ($pair1outFH, ">$outName.pair1.fastq") || die "Cannot Open pair out 1 File"; open ($pair2outFH, ">$outName.pair2.fastq") || die "Cannot Open pair out 2 File"; open ($pairSingleFH, ">$outName.single.fastq") || die "Cannot Open single File"; my @outputFiles = ($pair1outFH,$pair2outFH); my$idx=0; my $records=0; my$pairs=0; my $single=0; print "Copying reads\n"; for my$pairFH (@pairedFiles) { my $outFH =$outputFiles[$idx];$idx++;

while(my $head = <$pairFH>) { my $seq = <$pairFH>; my $head2 = <$pairFH>; my $qual = <$pairFH>;

my $key = substr($head, 0, length($head)-3); my$value = $readCount{$key};
my $fh;$records++;
if($value == 1) {$fh = $pairSingleFH;$single++;
}
elsif($value > 1) {$fh = $outFH;$pairs++;
}
else {
die "No value found for: $key\n"; } print$fh $head; print$fh $seq; print$fh $head2; print$fh $qual;  # print "\rTotal:$totalRecords, Records: $records, Pairs:$pairs, Single: $single "; } close($outFH); close($pairFH); } close($pairSingleFH); print "Total: $totalRecords, Records:$records, Pairs: $pairs, Single:$single\n"; [HTML] [HTML]

1
Entering edit mode

I put it directly in the post. Be aware that it's not optimized in the least and uses ram to keep the readnames + seen count.

There was a similar discussion in the velvet mailing list in which many other tools came out.

0
Entering edit mode

2
Entering edit mode
11.6 years ago
brentp 24k

Haibao Tang has a program that will accept a FASTA file of adaptors and trim those all in one go. It does some sort of local alignment on the adaptors so it will be more thorough tan the fastx_clipper and it can also trim low quality reads.

0
Entering edit mode

Thanks Brent !!

1
Entering edit mode
11.6 years ago

i think those fastx_clipper commands are not pair-friendly - i.e. if one of your sequences is "all adapter" you will orphan the mate and mess up the pairing.

0
Entering edit mode

That's a valid point Jeremy, I am getting "weird pairing" warning while running bwa sampe using fastx_* processed fastq files.

0
Entering edit mode

Actually I do not think it's fastx's fault. What do you get if you just feed the raw reads to bwa?

0
Entering edit mode

@lh3: raw reads runs without problem. I am not able to run bwa sampe (v 0.5.9-r16) using the fastx_ generated fastq files. bwa is keep running for several hours without creating any output. I found this post and tried with an older version (v 0.5.5), then it was running but showing the weird pairing. I tried bwa sampe runs several time, once it ran for about 24 hrs, finally I quit the run. Because of this issue am now re-checking my previous steps. Any thoughts on why bwa sampe not running on fastx_ ?

0
Entering edit mode

I see. If without fastx* bwa runs fine, it should be fastx's problem. I have never used fastx, so do not know what was happening.

0
Entering edit mode

Jeremy: Do you have any suggestion on a program that can deal with the pairing issues here ?