Question

ATAC-seq - remove mt reads using KMC

0

Entering edit mode

3.1 years ago

maxrwjones ▴ 60

Hi all,

For ATAC-seq data analysis, a key step is to remove reads mapping to the mitochondrial genome (and chloroplast for plants), which can be a considerable fraction due to them having relatively naked DNA.

It seems the majority of pipelines include mitochondrial (+/- chloroplast) sequence in the reference genome, build an index, then map all the reads to this reference. Subsequently, reads mapping to 'chrM' and 'chrC' (or however they are named in the index) are removed, e.g. by piping samtools view into grep -v or some purpose-built tool.

However, a bioinformatician at my institute has recommended I try a k-mer based approach to removing mitochondrial or plastidial reads. This would involve using KMC to build a k-mer database (e.g. 31-mer) of the mt and cp genomes, then using kmc_tools filter to identify and remove raw reads (before mapping) that contain mt or cp-derived k-mers.

This seems like it would work, and would likely reduce the computation required for the later mapping stage. However, it might eliminate reads that map optimally to mt or cp sequences that have become integrated into the nuclear genome. It will also require some optimisation of k-mer length and a decision on how many k-mers should be present in a read before it is discarded.

Given the wealth of data that will come out of a genome-wide chromatin accessibility analysis, the former may not be a huge loss, but nonetheless, I wanted to get some additional opinions on whether this approach sounds feasible and whether it is preferable.

(I'm fairly new to bioinformatics and don't really have a gut feeling for which approach will offer the best balance of sensitivity and specificity. i.e. fewest mt/cp reads erroneously kept, plus fewest nuclear reads erroneously removed.)

samtools chrM ATAC-seq mitochondria KMC • 1.5k views

ADD COMMENT • link 3.1 years ago by maxrwjones ▴ 60

1

Entering edit mode

I would focus my effort on downstream analysis and the science behind it rather than reinventing the wheel for a task that is trivial to solve with existing tools such as mentioned samtools.

ADD REPLY • link 3.1 years ago by ATpoint 82k

0

Entering edit mode

Thanks for your answer. That was in some ways what I hoped to hear! Do you reckon that the differences between the two methods would be minor?

ADD REPLY • link 3.1 years ago by maxrwjones ▴ 60

1

Entering edit mode

I do not see any reason to try the method actually. Alignment will assing the reads to the organelle genomes if they originate from it, then simply remove these reads. That is both established and simple, so why bothering with anything else? I doubt the other method improves in a way that you would notice it in the bigger picture.