Question: To Calculate The Exact Total Number Of Mapped Reads In Exome Regions
gravatar for ivivek_ngs
7.1 years ago by
Seattle,WA, USA
ivivek_ngs5.0k wrote:

Dear All,

I have some questions here. I want to do some quality control analysis on my exome data that are mapped on the reference genome. I am having the input bam file for a sample which contains reads that got mapped to reference genome(hg19.fa). So it is like my mapped reads are 80 million for this sample. Now I want to calculate out of this 80 million mapped reads how many got mapped into the exome region. For this I need to supply the exome baits bed file (probe/covered.bed) provided by the company. We used the Agilent SureSelectV4 here. So is there any one line command with which using these three informations (input.bam, hg19.fa and exome_baits.bed) I can calculate the total number of mapped reads on the exonic regions? Any one line command. In different posts I see a lot of tools being mentioned. I tried to used CalculateHSmetrics of Picard but it needs the bed file with header so of now use now. Then I used the walker of GATK which is the DepthofCoverage but there we usually get the mean of number of time a bases is read(for me its 73.9) and the %_of_bases_reads above 15 times is about 70% which is also a good qaulity, we also get how many loci has been read more than once which gives a histogram of cumulative reads coverage at each loci but if I want to just calculate the number of mapped reads that got mapped in the exome region using the input bam file, reference genome and exome_baits.bed file how shall I do it? Any single line command for that? This might be recurrent but am not getting any specific answer in the forums so I had to post. Any suggestions?

ADD COMMENTlink modified 7.1 years ago by Wei Shi0 • written 7.1 years ago by ivivek_ngs5.0k
gravatar for swbarnes2
7.1 years ago by
United States
swbarnes29.3k wrote:

Look into BEDTools intersectBed.

ADD COMMENTlink written 7.1 years ago by swbarnes29.3k
gravatar for Rahul Sharma
7.1 years ago by
Rahul Sharma610
Rahul Sharma610 wrote:

Please check this What Is The Best Pipeline For Human Whole Exome Sequencing?,

ADD COMMENTlink written 7.1 years ago by Rahul Sharma610
gravatar for Irsan
7.1 years ago by
Irsan7.2k wrote:

Do not use a file with the bait regions of your exome-seq sample prep kit. You are not interested in how many genes align in regions that your sample prep is aiming at. You are trying to get the amount of reads that align in coding sequences.

So download a gtf file with all genes, for example Homo_sapiens.GRCh37.73.gtf.gz and then use bedops to get the thing done

# unzip the gtf
gunzip -dc Homo_sapiens.GRCh37.73.gtf.gz | \
# only keep CDS features
awk -F "\t" '{if($3==CDS)print $0}' | \
# convert to bed with a bedops script
gtf2bed - | \
# collapse overlapping features into one | \
bedops --merge - > Hs_GRCh37.73.bed
# then use bedops to do the counting
# convert bam to bed
bam2bed < yourAlignments.bam | \
# and use bedops to do the counting | \
bedops --count Hs_GRCh37.73.bed - | \
# and use awk to get the total amount of reads
# that were counted by bedops
awk 'BEGIN{count=0}{count=count+$1}END{print count}'
ADD COMMENTlink modified 7.1 years ago • written 7.1 years ago by Irsan7.2k
gravatar for Wei Shi
7.1 years ago by
Wei Shi0
Wei Shi0 wrote:

The featureCounts program should be helpful for this -

ADD COMMENTlink written 7.1 years ago by Wei Shi0

Thank for the inputs, I have already done the read counts with the CountRead walker in GATK and it was a bug in the output format(as I am using the GATK version 2.3.4) which was not letting me understand the output properly but now its resolved. Thanks everyone for the suggestions.

ADD REPLYlink written 7.1 years ago by ivivek_ngs5.0k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2146 users visited in the last hour