Question: Just need to learn to do one thing: counting the reads that map to the human mitochondrial genome
0
gravatar for viottimanuel
4.8 years ago by
United States
viottimanuel0 wrote:

This is my scenario: I am doing whole genome amplification from human cells, and then doing NGS with an Illumina platform. I get bam files from the machine. I am trying to find out how many times the mitochondrial DNA is represented in each sample. I know that mitochondrial DNA gets amplified during my whole genome amplification (I confirmed with the manufacturer of the kit). I've gotten to the point where I downloaded IGV and can map the reads to the human mitochondrial genome. I can see the mapped reads. But I cannot figure out how to get a specific value, i.e. a count of the specific number of mapped reads in each sample. The idea is to be able to tell if one sample had more mitochondrial DNA than another sample. 

I am working on a Windows 7 64-bit computer. (But could have access to a mac as well)

Thanks so much any help is greatly appreciated!

ADD COMMENTlink modified 4.8 years ago by Alex Reynolds30k • written 4.8 years ago by viottimanuel0
1

"""can map the reads to the human mitochondrial genome"""

Hope you are mapping reads to other parts of the genome as well?  Meaning: whole (human?) genome. Because if all what you got are BAM files with reads mapped to just 16kb MT, the number of reads mapped will not tell you much about number of MT per cell. For that you will need a number of reads mapped to all chromosomes and to MT.  

Also: Which mapper?

ADD REPLYlink written 4.8 years ago by Darked894.2k
Agreed. You want to do copy number analyses and it helps a lot when you can relate the mitochodrial counts to the other chromosomes. For example you could calculate the log-ratio of chrM VS rest of the genome and use that as your test statistic
ADD REPLYlink modified 4.8 years ago • written 4.8 years ago by Irsan7.2k

Yes, the Illumina-sponsored software I am using gives me an absolute number of reads mapped to autosomes. So I was going to divide the number of reads to mitochondrial DNA by the number I get for autosomal mapped reads. The Mapper is called Bluefuse. I can map all my reads to the human genome, but I don't have the option to map to the mitochondrial genome. That would have been easiest.  

ADD REPLYlink written 4.8 years ago by viottimanuel0

"I don't have the option to map to the mitochondrial genome"

Because that would be pointless (unless your question is: did I got mitochondrial DNA in my sample?). The input FASTQ files differ in the total number of reads and in their quality. The two numbers: MT_mapped_in_A vs MT_mapped_in_B (assuming that taking 16kb out of 3bilons bases will not screw up the mappings). So you can have A tissue rich with mitochondria but with low DNA yield and really crappy sequencing vs B tissue having lower number of MT per cell, but on which the sun was shining the day(s) it was prepared and sequenced. So you got MT_mapped_in_B > MT_mapped_in_A. Not so great, no?

ADD REPLYlink written 4.8 years ago by Darked894.2k

I am trying to understand the point you are making. Would it be ok to do the comparison if I use BAM files only? For example SampleA_Reads in MT/SampleA_Reads in Human Genome VERSUS SampleB_Reads in MT/SampleB_Reads in Human Genome. If the first is a larger number than the second, I can assume there is more mtDNA in the first sample, correct? 

ADD REPLYlink written 4.8 years ago by viottimanuel0

Yes, you are right, this will give you the idea about the numbers of MT/cell in samples. If you want to get exact (well..) numbers, you will have to take into account the size of nuclear genome (2x3000Mb or rather 2x what is the non-MT contigs length in your BAM header SQ lines) vs MT (16kb). It is a bit more complicated (look for Genomic alignment and mappability), but better lets solve one issue at the time. 

MT sequence may have different names depending which genome version you are using. Are you able to load any of your BAM files (preferably the smallest) in IGV on your Windows, and check how chromosomes/MT sequences are named? Meaning i.e: chr1  vs 1, and chrMT vs MT.

ADD REPLYlink written 4.8 years ago by Darked894.2k

Probably booting your PC using LiveCD with Ubuntu/Lubuntu Linux, installing samtools using Synaptic (or apt-get) and processing your BAM files i.e. located on some USB drive will be hopefully the fastest and easiest route. Just check if BAMs you got out of Bluefuse  have been sorted and indexed (files with extension .bai). If not, give us the total number of BAMs and their average size. Because using some social skills and getting access to some Linux server/workstation makes more sense than fighting with terabyte of results on i.e 5 years old laptop for days on end.  

In any case, google/check once on Linux with samtools installed:  

samtools view -H your_A.bam | less

samtools idxstats  your_A.bam | less

(it will give you the text output for all contigs/chromosomes).

You will need a bit of command-line scripting for processing say more than 10 BAMs and few installed on Linux utilities (grep, awk) to grab the numbers from the samtools output. 

ADD REPLYlink written 4.8 years ago by Darked894.2k

I checked, and from Bluefuse I three files for each sample: .fastq, .bam and .bam.bai

Form your comment above, I gather that using the .bam.bai would be best

ADD REPLYlink written 4.8 years ago by viottimanuel0

You need both. Usage: samtools idxstats <in.bam>

Samtools will look for in.bam.bai See:

http://www.htslib.org/doc/samtools.html

ADD REPLYlink written 4.8 years ago by Darked894.2k
0
gravatar for Alex Reynolds
4.8 years ago by
Alex Reynolds30k
Seattle, WA USA
Alex Reynolds30k wrote:

Given a sorted BED file called extents.bed containing the genomic bounds of chromosomes for a genomic build (here's an example for hg19), you can use BEDOPS bedmap --count with --chrom chrM to get the number of reads over the mitochrondrial DNA:

$ bedmap --count --chrom chrM extents.bed <(bam2bed < readsA.bam) > answerA.txt

You could repeat this for each sample BAM file and compare answers:

$ bedmap --count --chrom chrM extents.bed <(bam2bed < readsB.bam) > answerB.txt
$ bedmap --count --chrom chrM extents.bed <(bam2bed < readsC.bam) > answerC.txt
...

Once you have all the answer-files, you can sort them by counts into a final table:

$ awk '{ print FILENAME"\t"$1; }' answer*.txt | sort -n -k2 -r - > finalAnswer.txt

Doing work on the command line is pretty powerful and if you are a beginner, I recommend taking an hour or two to learn some of the basics here. There are very few turnkey applications that solve problems, but a practically infinite number of ways to solve things with the command line.

ADD COMMENTlink modified 4.8 years ago • written 4.8 years ago by Alex Reynolds30k

Thanks Alex, but being a newbie this sounds a little technical. What program do I need to download to do what you are suggesting? 

ADD REPLYlink written 4.8 years ago by viottimanuel0
1

The jump you are trying to make, from a newbie to counting reads is a very big one. If you really want this fast, and insist on sticking to Windows, a short term solution is to get a Geneious software trial, import the reads and map them to human mtDNA. The trial would give you 31 days to do it.

ADD REPLYlink written 4.8 years ago by Adrian Pelin2.4k

The toolkit is called BEDOPS. It contains tools called bedops and bedmap that do efficient set and map operations on BED-formatted files. Among other features, it also offers convert2bed, which converts files in other genomic formats (like BAM) to the BED format, so that set and map operations can be done on those files. You would need to run it on the command line, under a Linux or Mac OS X installation. It does not run on Windows. The site I linked to has an overview of the toolkit, along with documentation for each tool.

ADD REPLYlink written 4.8 years ago by Alex Reynolds30k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1343 users visited in the last hour