I have a question I'm unable to answer myself for several weeks now.
I would like to add a new QC analysis to my pipeline (ONT data) which would be able to both :
- detect if our sequences contains any contaminants (bacteria ? virus ? human DNA ?)
- detect if the sequences belong to the specie we expected to sequence (for example, if the sequenced DNA is from an european perch, the expected result would be that my sequences are mapping to a fish genome of reference)
For that, the approach I can think of is to compare my reads to a bunch of references genomes. So what I had in idea was to use a chosed genome that is "common" (I know it doesn't really mean anything) enough so my reads map well to that reference. For example, I would like to use one virus genome to be able to detect any kind of viral contamination, and the same goes on with human genome, one bacteria genome, a fish genome... etc. I don't want to use a gigantic database with a bunch of everything because I don't want to have something to heavy to align again and because I don't need that precision.
I know this is not the best approache at all. Because there is no such thing as a reference genome for all virus or a reference genome for all bacteria. But I thought that for my simple purpose (because I do not try to identify exactly the contaminant and I don't want to retrieve the contaminated reads either), it could eventually fit. But I struggle to know what reference genome I should use for optimal results. Escherichia coli for bacteria ? Drosophila melanogaster for insects ? Or other way I was thinking of is to create several hybrid fasta for each categorie I would like to detect, and that fasta file would contain 4 or 5 different species for each genus (5 insects for insects, 5 bacteria for bacteria...).
What do you think about my ideas ? Do you see any major cons of such an approache that won't fit the analysis I'm trying to make ? Do you have any other suggestions I coudln't think of ? Thanks a lot for your advices !