For ATAC-seq data analysis, a key step is to remove reads mapping to the mitochondrial genome (and chloroplast for plants), which can be a considerable fraction due to them having relatively naked DNA.
It seems the majority of pipelines include mitochondrial (+/- chloroplast) sequence in the reference genome, build an index, then map all the reads to this reference. Subsequently, reads mapping to 'chrM' and 'chrC' (or however they are named in the index) are removed, e.g. by piping samtools view into grep -v or some purpose-built tool.
However, a bioinformatician at my institute has recommended I try a k-mer based approach to removing mitochondrial or plastidial reads. This would involve using KMC to build a k-mer database (e.g. 31-mer) of the mt and cp genomes, then using kmc_tools filter to identify and remove raw reads (before mapping) that contain mt or cp-derived k-mers.
This seems like it would work, and would likely reduce the computation required for the later mapping stage. However, it might eliminate reads that map optimally to mt or cp sequences that have become integrated into the nuclear genome. It will also require some optimisation of k-mer length and a decision on how many k-mers should be present in a read before it is discarded.
Given the wealth of data that will come out of a genome-wide chromatin accessibility analysis, the former may not be a huge loss, but nonetheless, I wanted to get some additional opinions on whether this approach sounds feasible and whether it is preferable.
(I'm fairly new to bioinformatics and don't really have a gut feeling for which approach will offer the best balance of sensitivity and specificity. i.e. fewest mt/cp reads erroneously kept, plus fewest nuclear reads erroneously removed.)