To Calculate The Exact Total Number Of Mapped Reads In Exome Regions
4
0
Entering edit mode
10.4 years ago
ivivek_ngs ★ 5.2k

Dear All,

I have some questions here. I want to do some quality control analysis on my exome data that are mapped on the reference genome. I am having the input bam file for a sample which contains reads that got mapped to reference genome(hg19.fa). So it is like my mapped reads are 80 million for this sample. Now I want to calculate out of this 80 million mapped reads how many got mapped into the exome region. For this I need to supply the exome baits bed file (probe/covered.bed) provided by the company. We used the Agilent SureSelectV4 here. So is there any one line command with which using these three informations (input.bam, hg19.fa and exome_baits.bed) I can calculate the total number of mapped reads on the exonic regions? Any one line command. In different posts I see a lot of tools being mentioned. I tried to used CalculateHSmetrics of Picard but it needs the bed file with header so of now use now. Then I used the walker of GATK which is the DepthofCoverage but there we usually get the mean of number of time a bases is read(for me its 73.9) and the %_of_bases_reads above 15 times is about 70% which is also a good qaulity, we also get how many loci has been read more than once which gives a histogram of cumulative reads coverage at each loci but if I want to just calculate the number of mapped reads that got mapped in the exome region using the input bam file, reference genome and exome_baits.bed file how shall I do it? Any single line command for that? This might be recurrent but am not getting any specific answer in the forums so I had to post. Any suggestions?

ngs exome-sequencing coverage bedtools gatk • 6.4k views
ADD COMMENT
1
Entering edit mode
10.4 years ago

Look into BEDTools intersectBed.

ADD COMMENT
0
Entering edit mode
ADD COMMENT
0
Entering edit mode
10.4 years ago
Irsan ★ 7.8k

Do not use a file with the bait regions of your exome-seq sample prep kit. You are not interested in how many genes align in regions that your sample prep is aiming at. You are trying to get the amount of reads that align in coding sequences.

So download a gtf file with all genes, for example Homo_sapiens.GRCh37.73.gtf.gz and then use bedops to get the thing done

# unzip the gtf
gunzip -dc Homo_sapiens.GRCh37.73.gtf.gz | \
# only keep CDS features
awk -F "\t" '{if($3==CDS)print $0}' | \
# convert to bed with a bedops script
gtf2bed - | \
# collapse overlapping features into one | \
bedops --merge - > Hs_GRCh37.73.bed
# then use bedops to do the counting
# convert bam to bed
bam2bed < yourAlignments.bam | \
# and use bedops to do the counting | \
bedops --count Hs_GRCh37.73.bed - | \
# and use awk to get the total amount of reads
# that were counted by bedops
awk 'BEGIN{count=0}{count=count+$1}END{print count}'
ADD COMMENT
0
Entering edit mode
10.4 years ago
Wei Shi • 0

The featureCounts program should be helpful for this - http://bioinf.wehi.edu.au/featureCounts

ADD COMMENT
0
Entering edit mode

Thank for the inputs, I have already done the read counts with the CountRead walker in GATK and it was a bug in the output format(as I am using the GATK version 2.3.4) which was not letting me understand the output properly but now its resolved. Thanks everyone for the suggestions.

ADD REPLY

Login before adding your answer.

Traffic: 2065 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6