I have exome sequencing data for oral rinse samples tested for hereditary cancers. I am looking for potential non-human (cow or pig) contaminants. What would be the best methodology to detect the contaminants? What so far what I have done is : 1. Align fastqs with human reference(bwa-mem) 2. Samtools to extract unmapped reads from aligned bam files 3. Build an index of unmapped reads assumed to be potential contaminants. [samtools view -u -f 12 -F 256] (both mates unmapped) 4. Map the unmapped reads with Cow and Pig reference (bwa-mem) 5. Extract mapped reads from this alignment 6. Confirm those which map exclusively to one ref. [Quality checks and coverages are calculated, reads with MQ 8 are considered for analysis]
Any suggestions are appreciated.
An alternative could be bin the reads that map to pig/cow (or you could bin them to humans) using BBSplit from BBMap. This tool is designed for this specific application.