Question

Contaminant Filtering Large Fastq Data Sets

11

Entering edit mode

12.5 years ago

ALchEmiXt ★ 1.9k

When sequencing samples one should of course try avoid any possible contaminant before hand. No doubt. But occasionally it is just impossible to have a clean starting sample; so other sample material is contaminating the sequence data you are really interested in.

In our specific case, where we do de novo and resequencing of pathogenic bacteria or viruses, it simply cannot be avoided that the datasets contain a certain amount of contaminating host DNA (or RNA for RNAseq) we are not interested in (usually these pathogens are sourced after cultivation in or on cells, or the clinical sample contains (lysed) cells).

So we took a rather simple approach trying to filter our large datasets we obtain regularly; (illumina over 40-60M sequences per sample) by aligning the individual reads to some reference genomes of the possible contamination. Similar to the approach of the fastq_screen tool from Babraham labs. There they use bowtie to align reads to reference genomes or contaminant databases like adapter sequences and/or PhiX.

In our specific case we thought it worked quite well: just bowtie mapping (default parameters) all our fastq data to contaminant genomes and adapters seqs. Determine the hits and just filter those from the resulting fastq files before subsequent processing. All scripts there and working great...

BUT further inspection of the resultant filtered datasets showed still a significant amount of abberantly non-filtered data. Checking out these sequences it turned out they either mapped over point zero of normally circular phage (PhiX) or bacterial sequences so the hit was not measured significant by the artificial split in the mapping, but others had over 86% identity over the analysed lengths of 100-75bp.

We figured the bowtie mapping is probably too strict to use for this kind of filtering..

QUESTION (finally):

1) How do or would you deal with this kind of contaminant filtering? We do FIRST filter loq quality and adapters and such.

2) Which algorithm and why? (cutoffs?)

We thought of BLAST, but for this many sequences it is resource expensive while parsing the outcomes is troublesome as well. In addition, those tools require conversions to fasta...which can be done but is just again an additional (unwanted) conversion step.

Would BLAT or MUMmer be the way to go and what type of cut-offs? (I understand that generic cut-offs are dificult and probably too project/question specific, just an estimated guess).

Ideally I would like to avoid resource expensive conversion (like fastq qulalityscore grooming) as well.

I have found similar questions on biostar but they either don't have the answer or they wanted to filter sequences of unsequenced species. The best matching question was here but doesn't answer our question either.

Thanks for any pointers!

next-gen sequencing filter fastq mapping • 12k views

ADD COMMENT • link updated 12.5 years ago by Bach ▴ 550 • written 12.5 years ago by ALchEmiXt ★ 1.9k

0

Entering edit mode

Hi, assuming that you align your reads to genome of a host (if there is a known genome), did you try to use BWA as aligner? BWA allows gaps in alignment.

Ilia

ADD REPLY • link 12.5 years ago by Zhidkov ▴ 600

0

Entering edit mode

@zhidkov, no not yet. I thought to first have a poll at biostar before trying all different aligners and feel lucky. But bwa is a good candidate since it would allow gapped alignments contrasting bowtie.

ADD REPLY • link 12.5 years ago by ALchEmiXt ★ 1.9k

0

Entering edit mode

Additional approach that might help is generation of your "contamination" data during your QC: contigs from de-novo assembly, recognized by you/your_team as contamination can be used in future as "contamination references for filtering".

ADD REPLY • link 12.5 years ago by Zhidkov ▴ 600

0

Entering edit mode

Even though I accepted BaCH's answer. Please let me know your thoughts. Thus far his approach seems most renewing and in line of our previous thoughts. The HMM approach as suggested by Peter seems to involve quite some resources for the established DB....but we'll check it out anyway. Thanks all.

ADD REPLY • link 12.5 years ago by ALchEmiXt ★ 1.9k

score 3 · Answer 1 · 2011-10-12

3

Entering edit mode

12.5 years ago

pmenzel ▴ 310

Maybe some taxonomic classification software works for you, like Phymm (http://www.cbcb.umd.edu/software/phymm/), which compares reads against ncbi bacterial reference genomes.

ADD COMMENT • link 12.5 years ago by pmenzel ▴ 310

score 3 · Answer 2 · 2011-10-12

3

Entering edit mode

12.5 years ago

Bach ▴ 550

Here's what I do: 1) define what the contaminants are and 2) k-mer filtering of reads with long k-mers (31 or 32) against those contaminants. The threshold can be chosen at will, but in the most stringent setting, any k-mer from a read that hits a k-mer from the contaminant leads to the read being filtered out.

On the one hand the k-mer size should be chosen long enough to be more or less unique for the combined set of contaminants & your target data, on the other hand it should not be too long because of possible sequencing errors. The lower 30's have worked quite well for me, contaminants that stay in the data set usually have lots of sequencing errors and are normally discarded by NGS assembly software.

I use 'mirabait' for that job: http://mira-assembler.sourceforge.net/docs/DefinitiveGuideToMIRA.html#sect_mutils_bait

Disclaimer: I wrote mirabait and find it useful, your mileage may vary.

ADD COMMENT • link 12.5 years ago by Bach ▴ 550

0

Entering edit mode

I like this other approach. Was thinking of k-mers before since we generally use it for QC issues to look at general overrepresentation/errors in seq datasets, but had no idea yet how to go forward with it yet. this might be a good oppertunity. Sligthly depends on how easy/accurate parameters can be tuned. How resource intensive is this approach?

ADD REPLY • link 12.5 years ago by ALchEmiXt ★ 1.9k

0

Entering edit mode

Hi Bach, I was hoping to remove human host sequence contamination from my large (8.84 GB) illumina sequencing file for a poxviridae virus. I am using mira for assembly, but I first require a reduction in sequence coverage and contaminants. I noticed that mirabait does not delete both pairs (R1 and R2) if only one matches, and I am concerned to then have mismatched pair-ended reads in my final file. Does mira have a script for removing unpaired reads? Thanks!

ADD REPLY • link 8.2 years ago by kathleen.mcleod14 ▴ 10