Total Number Of Mapped Bases In Exome Ngs Data
3
0
Entering edit mode
10.4 years ago
User6891 ▴ 330

Hi,

I would like to calculate the total number of mapped bases for whole exome data from Illumina.

I'm already calculating the coverage per base with samtools mpileup or samtools depth. I was thinking of just taking the sum of the coverage of each base, but when doing this in Linux with 'awk' I get an 'out of memory alloc' error. That's why I was wondering if there are some specific tools that calculate the total number of mapped bases?

ngs linux coverage • 4.7k views
ADD COMMENT
0
Entering edit mode

What's your awk command? That should be doable in awk.

ADD REPLY
0
Entering edit mode

cat myfile.txt | awk '{SUM += $NF} END {print SUM}'

ADD REPLY
0
Entering edit mode

That seems like an odd way to do things. NF is the number of fields in the current line. Also, you never initialized SUM (although perhaps things are auto-initialized to 0 in awk). I would assume that something like cat pileup.txt | awk 'BEGIN{$SUM=0}{$SUM += $3}END{print $SUM}' would work.

ADD REPLY
0
Entering edit mode

I want to take the sum of all the values in the last column of my file, except for the first line (which is a header)

ADD REPLY
0
Entering edit mode

Ah, it would be really helpful if you posted a couple lines from that file. It's rather difficult to read your mind. I'm going to reply below in a comment about what I suspect the root of the "out of memory alloc" error is.

ADD REPLY
0
Entering edit mode

It's just a tab-delimited file with 'n' columns and 'm' lines. First line is a header line. I found a command to skip the header line with awk: awk '{if(NR>1)print}'. If I combine this with awk '{print $NF}', it only prints the values from the last column ( without header line). Then I found this command to sum the values of a column: paste -sd+ | bc So when I combine those 4 commands, I thought it would calculate the sum of the values in the last columns. But when doing that I get an error of memory alloc

ADD REPLY
1
Entering edit mode

Oh, I see what you're doing wrong. You're linearizing the last column, which is probably pretty long, and then trying to parse it with awk. As I mentioned below, this sort of approach is pretty memory intensive. I'll just post an answer with a single line solution. In the future, please post all of the relevant information when you post the question.

ADD REPLY
0
Entering edit mode

The root cause of the "out of memory alloc" error is likely that awk is trying to read your whole file as a single line. awk has a good bit of overhead when dealing with arrays, so this is going to eat up all of the memory on your system. You might try either changing the line endings on the original file or, better yet, just tell awk what the correct line ending is. Alternatively, just parse the pileup as in the example from my comment above.

ADD REPLY
1
Entering edit mode
10.4 years ago
Dan D 7.4k

You could use Picard's CalculateHSMetrics tool. It's not as elegant as a command-line approach, but it gives a lot of other useful information.

ADD COMMENT
1
Entering edit mode
10.4 years ago

Here is the one line awk solution, given your NxM text file with a header line:

cat some_file.txt | awk 'BEGIN{$otal=0;}{if(NR>1) { total += $NF; }} END{print total;}'

or something like that.

ADD COMMENT
0
Entering edit mode

It's working except that you forgot a 't' in 'BEGIN$total=0;'

ADD REPLY
0
Entering edit mode

No, but that's why I wrote "or something like that", since I don't have otherwise know what your file looks like. I've just created a small test file and updated the answer so it processes that correctly (I've been writing too much in other languages recently and added some extraneous $'s and had an NR where there should have been an NF).

ADD REPLY
0
Entering edit mode
10.4 years ago

use samtools to count all mapped reads from a bam

samtools view -c -F 4 in.bam

ADD COMMENT
1
Entering edit mode

That's just the number of mapped reads. The op wants the number of mapped bases.

ADD REPLY
0
Entering edit mode

oups sorry I read the question to fast

ADD REPLY
0
Entering edit mode

Been there done that :)

ADD REPLY
0
Entering edit mode

I have some questions here. I want to do the same thing but in some different way. I am having the input bam file for a sample which contains reads that got mapped to reference genome(hg19.fa). So it is like my mapped reads are 80 million for this sample. Now I want to calculate out of this 80 million mapped reads how many got mapped into the exome region. For this I need to supply the exome baits bed file (probe/covered.bed) provided by the company. We use the Agilent SureSelectV4 here. So is there any one line command with which using these three informations (input.bam, hg19.fa and exome_baits.bed) I can calculate the total number of mapped reads on the exonic regions? Any suggestions?

ADD REPLY

Login before adding your answer.

Traffic: 2008 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6