I aligned a set of reads with C elegans genome. The alignment scores were around 80%, except for two samples, which hit 40%. I blasted the unaligned reads and it seems to come from drosophila (which we have no idea why). I aligned the samples again, this time for drosophila, and those 2 samples got a score of around 40% as well. Because the sample size is small I have been considering discarding the unmapped reads instead of discarding the whole sample. I assume a normalization like TMM could reduce the possible noise caused by the reduced counts and if the PCA clusters make sense, I would use the data in downstream analysis. Any opinions on this? Should I just discard those samples?
I would be very skeptical of the reads unless you figure out why there was so much contamination from an exogenous organism. Was your sequencing run shared with anyone else? Perhaps there could have been a mixup with barcodes or something.
We do have drosophila samples we sent to the same place for sequencing, so mislabeling was the initial suspicion. So I tried aligning the samples with drosophila genome, the clean samples had less than 1% of alignment and the dirty samples had ~40%. I also aligned the drosophila samples with c elegans genome and it was less than 1% too, with ~ 94% of alignment with drosophila. Very impressive from HISAT2 I guess. So I think is more likely samples got actually mixed somehow.
Sounds like there was definitely some sample mixup or problem somewhere. I wouldn't be confident in the reads you did recover from the samples, because there is no guarantee the labels are correct for those.
Your best bet would be to talk to the sequencing provider and also go back and see if there was any problem during sample collection.
If you are sure that your original samples were actually from C. elegans then you can ask your sequencing provider to re-make and resequence the libraries. Or at least check to make sure nothing amiss happened on their end.
I guess one possibility, if you are sure you have C. elegans and drosophila would be to combine the reference genomes you are aligning against, then align all the reads to this 'hybrid' and see if they partition between the two samples. The alignment score as you have already looked at should improve as an indicator. I would consider that the safest way of being able to use the reads. Then it depends what analysis is planned downstream..