Mitochondrial contamination over 90% in published atac-seq dataset.
0
3
Entering edit mode
4.5 years ago

Hi all,

I know this topic has been touched in the past but I couldn't find an answer to my question. Background: in my lab, we are thinking to move from chip-seq of histone marks to atac-seq. We have data on CD4+ T cells and we would like to compare what we found with an already published dataset in the same cell type (Buenrostro et al. 2013: GEO: GSE47753). In addition, I would use this as a test to see if we have the capabilities of doing an atac-seq analysis.

So I downloaded the 6 sra files regarding the CD4+ T cells, produced the fastq files with relative quality control and aligned to hg19 using bowtie2 ( -X2000 ). I know in the paper they use bowtie instead of bowtie2 but I didn't think that could make a big difference at this step. Looking at the SAM fils after the alignment, I realized that more than 90% of the reads are mapped to mitochondrial DNA leaving me with 50 to 200 thousand reads to do peak calling. I tried to align using bwa mem but no much difference was found. In the paper, they declare to have an average around 46% of reads mapping to mtDNA.

So the questions: what am I doing wrong? Did anyone else find the same thing? Am I misreading the methods of the paper?

Thanks in advance and I'm sorry for such a beginner question.

Daniele.

atac-seq mtDNA bowtie2 mitochondrial contamination • 3.9k views
1
Entering edit mode

How did you calculate the percentage of mtDNA? As I already said in the ATAC-seq community, I quickly checked the first of these T cell samples and it is about 50%.

That is the script I typically use to get the percentage, using SAMtools idxstats (requires indexed BAM):

#!/bin/bash

#### Calculate percentage of reads mapped to mitochondrial genome (mtDNA) using SAMtools idxstats
#### Can be useful for ATAC-seq data. Requires an indexed BAM file:

## Check if index is present. If not, create it:
if [[ ! -e ${1}.bai ]]; then echo '[INFO]: File does not seem to be indexed. Indexing now:' samtools index$i
fi

## Calculate %mtDNA:
mtReads=$(samtools idxstats$1 | grep 'chrM' | cut -f 3)
totalReads=$(samtools idxstats$1 | awk '{SUM += $3} END {print SUM}') echo '==> mtDNA Content:'$(bc <<< "scale=2;100*$mtReads/$totalReads")'%'


Usage: ./check_mtDNA.sh in.bam

0
Entering edit mode

Thanks ATPoint for sharing with me your script, I'll definitely use that in the future. I did no calculation, to be honest, I just filtered the bam file for chrM tags and I was left with less than 10% of the mapped reads. Still don't know what I'm doing wrong with those files, don't know if is a mapping or a filtering issue but I wasn't able to analyse those data. I've spoken with Dr. Corces and he suggested to use different datasets. Thanks again for your time.

0
Entering edit mode

Not trying to defend the published work but if you were trying to reproduce published analysis then first thing you should do is stick with programs/parameters used in that study. Since you have a major variable (bowtie2 can do gapped alignments as opposed to bowtie v.1) it would be improper to question published results at this time.

0
Entering edit mode

Questioning published results wasn't the purpose of my query at all, in fact, I'm pretty sure that the problem is at my end. I used two different aligners obtaining similar results. But you pointed out something solid and I'll use bowtie v.1 to perform the alignments. Thank for your reply.

0
Entering edit mode

Apologies for mistaking your intent. It is not the first time one finds that the published results may not be reproducible or have a significant error in them. If your result is correct then having 90% reads mapping to MT would indeed be worrisome. Since you want to compare your results to theirs you may have to redo the analysis of your data with bowtie v.1, if that looks more like what is published.

0
Entering edit mode

Thank again, I will post the results of the new analysis.

0
Entering edit mode

The new analysis using bowtie instead of bowtie2 gave a bit better results with an average MT mapping around 80%. Still high, but is the best I could get and I don't really know where I find this discrepancy in the data. I will call the peaks anyway and see what happens. Do you have any further suggestion? Thanks for helping!