When sequencing samples one should of course try avoid any possible contaminant before hand. No doubt. But occasionally it is just impossible to have a clean starting sample; so other sample material is contaminating the sequence data you are really interested in.
In our specific case, where we do de novo and resequencing of pathogenic bacteria or viruses, it simply cannot be avoided that the datasets contain a certain amount of contaminating host DNA (or RNA for RNAseq) we are not interested in (usually these pathogens are sourced after cultivation in or on cells, or the clinical sample contains (lysed) cells).
So we took a rather simple approach trying to filter our large datasets we obtain regularly; (illumina over 40-60M sequences per sample) by aligning the individual reads to some reference genomes of the possible contamination. Similar to the approach of the fastq_screen tool from Babraham labs. There they use bowtie to align reads to reference genomes or contaminant databases like adapter sequences and/or PhiX.
In our specific case we thought it worked quite well: just bowtie mapping (default parameters) all our fastq data to contaminant genomes and adapters seqs. Determine the hits and just filter those from the resulting fastq files before subsequent processing. All scripts there and working great...
BUT further inspection of the resultant filtered datasets showed still a significant amount of abberantly non-filtered data. Checking out these sequences it turned out they either mapped over point zero of normally circular phage (PhiX) or bacterial sequences so the hit was not measured significant by the artificial split in the mapping, but others had over 86% identity over the analysed lengths of 100-75bp.
We figured the bowtie mapping is probably too strict to use for this kind of filtering..
1) How do or would you deal with this kind of contaminant filtering? We do FIRST filter loq quality and adapters and such.
2) Which algorithm and why? (cutoffs?)
We thought of BLAST, but for this many sequences it is resource expensive while parsing the outcomes is troublesome as well. In addition, those tools require conversions to fasta...which can be done but is just again an additional (unwanted) conversion step.
Would BLAT or MUMmer be the way to go and what type of cut-offs? (I understand that generic cut-offs are dificult and probably too project/question specific, just an estimated guess).
Ideally I would like to avoid resource expensive conversion (like fastq qulalityscore grooming) as well.
I have found similar questions on biostar but they either don't have the answer or they wanted to filter sequences of unsequenced species. The best matching question was here but doesn't answer our question either.
Thanks for any pointers!
Hi, assuming that you align your reads to genome of a host (if there is a known genome), did you try to use BWA as aligner? BWA allows gaps in alignment.
@zhidkov, no not yet. I thought to first have a poll at biostar before trying all different aligners and feel lucky. But bwa is a good candidate since it would allow gapped alignments contrasting bowtie.
Additional approach that might help is generation of your "contamination" data during your QC: contigs from de-novo assembly, recognized by you/your_team as contamination can be used in future as "contamination references for filtering".
Even though I accepted BaCH's answer. Please let me know your thoughts. Thus far his approach seems most renewing and in line of our previous thoughts. The HMM approach as suggested by Peter seems to involve quite some resources for the established DB....but we'll check it out anyway. Thanks all.