I have a de novo transcriptome for an ant species of about 100k transcripts assembled by Trinity. I've quantified expression using Sailfish and performed some analyses to identify genes with significant expression patterns.
Then, I became aware of the program DeconSeq to remove contaminants. I ran the program and removed ~5k of my transcripts, and spot-checking showed they all BLAST to bacteria or human. All good so far. Then, I re-ran Sailfish on the "cleaned" transcriptome of about 99k transcripts. My naive expectation was that this should barely effect the results - the reads that mapped to the "contaminants" shouldn't map at all. Instead, I find substantial changes, with about twice as many genes having significant expression patterns. I checked the unmapped ratio and it stays the same between mappings to the two transcriptome files.
So - my conundrum is how to deal with mapping reads to known contaminants?
a) Map reads to the complete transcriptome, including contaminants, and then remove the known contaminants. In this case, it seems that I risk incorrectly mapping reads to the contaminants and losing information (false negatives), analogous to a Type I (producer's) error.
b) Map reads to the "cleaned" transcriptome. This seems analogous to a Type II (consumer's) error where I risk finding significant changes in expression that are due to incorrectly mapping 'contaminant' reads to true transcripts (false positives).
Any thoughts appreciated!
cross-posted on Sailfish user group here