Just wondering if you guys have come across or developed any process to check for contamination in PacBio reads. By contamination what I mean is to align/blast the PacBio reads to some reference like (nt, 16s etc) to see if proportion of reads hit something not expected pointing to a possible contamination.
Since there is inherent indel errors in the PacBio data, standard blast penalizes the alignment (due to lot of possible gap introductions) which could result in reads not aligning at all.
It would be nice to know what you guys are doing with the contamination check on Pacbio long reads.
Hi, Every once in a while I ran into contamination problems. I created a blasr index of almost all prokaryotes, and just aligned to that (using blasr). The only caveat is I never updated blasr to allow for larger than 32 bit indexing, and so the largest database you can use is 4G. NCBI's set of prokaryotes is > 4G, and so I pruned it down by removing bacteria from the same strain. If you are looking for human contamination, you can just align to human in one go.