Originally I was using allprep which is a simple python script but I found that I was still getting some adapter contamination afterwords.
I then found out about tagdust which did a great job of removing the adapters but split things into two files: a 'clean' file and an 'artifact' file which leads me to my 1st set of questions: Do people usually work off the 'clean' file or mask the adapter in the 'artifact' file and merge it back in? I am using 50bp reads so even with the adapters removed I should still have quite a bit of sequence left. If I mask the adapters and merge the files, should I set a length cutoff in case a read is too short for some reason?
Next I have to de-barcode the files, I've heard many good things about fastx toolkit since it can match barcodes up to one mismatch but for pair-end reads the read-pairing is off afterwards (I actually tried to put it into BWA and it gave many "fail to infer insert size" messages. brentp has code posted to fix this and I also found another program to re-pair but seems overly complicated so now I am slowly reading the code to figure out the differences between the two methods.
While we are on the subject of preprocessing, I noticed that fastx toolkit has a remove duplicates feature which I thought might be nice to do before the alignment process since after aligning I would run markDuplicates anyways. I found some code (thanks again brentp) to output fastq instead of fasta but ideally you would want to only remove the read if both fwd and reverse reads were identical right? since that way the whole fragment would be identical (PCR duplicate)
I was also wondering if it was appropriate to do the quality filtering at this step using something like this or this since I usually use a -q20 option in bwa. I am currently using reads generated from Illumina 1.5 pipeline and I run bwa with -I command, does anyone know if the -q20 or -I command would be sufficient to remove the poor quality reads mentioned here (biostar Q)?
Here is the final bwa lines
bwa aln -t 4 -q 20 -I -B 6 ../human-assembly/human_g1k_v37.fasta ./run1_0h-1.fq > ./tmp1a.sai bwa aln -t 4 -q 20 -I -B 6 ../human-assembly/human_g1k_v37.fasta ./run1_0h-2.fq > ./tmp1b.sai bwa sampe ../human-assembly/human_g1k_v37.fasta tmp1a.sai tmp1b.sai ./run1_0h-1.fq ./run1_0h-2.fq | samtools view -Shu - | samtools sort - ./run1_0h_q20
remove adapter (tagdust)
de-multiplex (fastx barcode splitter)
remove duplicates? (fastx collapser)
change quality value/ filter on quality?
redo pairing (brentp's code)
am I missing anything?
Since I posted this question a couple years ago, I figure I should share some insight into what I am doing now. Sequences with adaptors and barcodes included are assumed to not map well so instead of removing them a-priori or trimming I set a mapping quality filter. Duplicates are removed after mapping using either MarkDuplicates from picard or quality filter for non-uniquely mapping reads from BWA (using something like
samtools view -F 1548)