Compute mean depth coverage for exome data with paired end, overlapping, features
1
1
Entering edit mode
4.0 years ago
fbrundu ▴ 300

I am trying to get the mean depth coverage for bam files with bedtools.

So far I've been using the command:

bedtools coverage -hist -abam bamfile.bam -b targets_exome.bed > bamfile.bam.hist.all.txt

But I find myself with a result similar to:

1   9998    10030   HWI-H217:64:C48Y8ACXX:4:1110:12530:87823/1  1   -   9998    10030   0,0,0   1   32, 0,  0   32  32  1.0000000

1   9998    10030   HWI-H217:64:C48Y8ACXX:4:1110:12530:87823/2  1   -   9998    10030   0,0,0   1   32, 0,  0   32  32  1.0000000

1   9999    10096   HWI-H217:64:C48Y8ACXX:4:1303:11532:70370/1  0   -   9999    10096   0,0,0   1   97, 0,  0   97  97  1.0000000

1   9999    10096   HWI-H217:64:C48Y8ACXX:4:1303:11532:70370/2  0   -   9999    10096   0,0,0   1   97, 0,  0   97  97  1.0000000

where the different features are overlapping (see #1 and #3).

I want to compute average coverage, defined as the number of reads covering the captured coding sequence of a haploid reference. How to do it in this case?

coverage bam depth exome • 6.3k views
ADD COMMENT
13
Entering edit mode
4.0 years ago

To get the mean depth of coverage for each interval specified in your BED file, use:

bedtools coverage -a BED -b [BAM] -mean > MeanCoverageBED.bedgraph

If you wanted to get the overall mean, then just pipe this into awk and get the average of the average, something like:

bedtools coverage -a BED -b [BAM] -mean | awk '{total+=$4} END {print total/NR}'

There are other commands with bedtools that I use as standard in my pipelines:

output depth of coverage for all regions in the BAM file

NB - sequential positions at the same read depth are merged into a single region

bedtools genomecov -ibam [BAM] -bga -split > CoverageTotal.bedgraph

output the per base read depth for each region in the BED file

bedtools coverage -a [BED] -b [BAM] -d > PerBaseDepthBED.bedgraph

get percent genome covered

zero=$(bedtools genomecov -ibam [BAM] -g [ref.fasta] -bga | awk '$4==0 {bpCountZero+=($3-$2)} {print bpCountZero}' | tail -1)
nonzero=$(bedtools genomecov -ibam [BAM] -g [ref.fasta] -bga | awk '$4>0 {bpCountNonZero+=($3-$2)} {print bpCountNonZero}' | tail -1)
percent=$(bc <<< "scale=6; ($nonzero / ($zero + $nonzero))*100")
ADD COMMENT
0
Entering edit mode

Thanks Kevin. My doubt is, to get the average coverage, since I have overlapping regions, it is necessary to get the per-base read depth, right?

ADD REPLY
0
Entering edit mode

Hey, it really depends on what exactly you want.

If you want BEDTools to treat each BED region independently (and calculate the mean read depth in each), then use bedtools coverage -a BED -b BAM -mean > MeanCoverageBED.bedgraph. This will ignore the fact that regions may be overlapping, I believe.

Another option is to first edit your BED regions so that overlapping regions are merged into a single entry, using BEDTools merge. After you do that and re-run the above command for determining mean read depth, you'll get a more informative mean for your regions.

You can also output the per-base read-depth, but you'll then have to calculate the mean yourself (possibly using awk again).

BEDTools is very powerful but one has to exactly sure what they want to get and which specific commands to use. There are many other tools for computing coverage/read-depth metrics too.

ADD REPLY
1
Entering edit mode

Thanks.. bedtools merge is exactly what I was looking for!

ADD REPLY
0
Entering edit mode

Great! Glad to have helped out

ADD REPLY
1
Entering edit mode

What version of bedtools is that? Version: v2.17.0 doesn't have the "-mean:" option. only -hist

nevermind, found it in v2.27.1

ADD REPLY
0
Entering edit mode

What is the resource requirement for coverage calculation? I am trying on some large windows (1Mbp) segments, it crashed bedtools version 2.24.0 several times when I tried.

This suprises me - given bedtools should be handle genome coverage, it shouldn't crash when calculating some 1Mb windows' coverage? Or it's a bug in 2.24.0

Thanks for your insight in advance!

ADD REPLY
0
Entering edit mode

Should not be a bug. Can you let us know the command that you are running? Also, the size of your BAM file, and number of regions in your BED file?

ADD REPLY
0
Entering edit mode

Hi, I used the command bedtools coverage -a BED -b [BAM] -mean > MeanCoverageBED.bedgraph to generate mean depth of coverage

cat taregts.bed
#CHR    START   END Gene
chr1    11106936    11107192    MTOR
chr1    11107388    11107655    MTOR
chr1    11108018    11108369    MTOR
chr1    11109097    11109459    MTOR
chr1    11109476    11109750    MTOR
chr1    11111018    11111177    MTOR
chr1    11112840    11113006    MTOR
chr1    11114308    11114538    MTOR
chr1    11114749    11114989    MTOR
chr1    11115312    11115590    MTOR

After running a while it showed a message killed

ADD REPLY
0
Entering edit mode

You ran out of compute resources.

ADD REPLY
0
Entering edit mode
lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              40
On-line CPU(s) list: 0-39
Thread(s) per core:  2
Core(s) per socket:  10
Socket(s):           2
NUMA node(s):        1
Vendor ID:           GenuineIntel
CPU family:          6
Model:               79
Model name:          Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
Stepping:            1
CPU MHz:             1197.201
CPU max MHz:         3100.0000
CPU min MHz:         1200.0000
BogoMIPS:            4389.46
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            25600K
NUMA node0 CPU(s):   0-39
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap intel_pt xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts md_clear flush_l1d

My system has 400GB free space and 128 GB of RAM

ADD REPLY
0
Entering edit mode

On which system is it? - a managed cluster? It may have reached the max compute time specified by your admins.

ADD REPLY
0
Entering edit mode

My bam file

       samtools view -H test.bam | grep @HD
        @HD VN:1.6  SO:coordinate
        @SQ SN:chr1 LN:248956422
        @SQ SN:chr2 LN:242193529
        @SQ SN:chr3 LN:198295559
        @SQ SN:chr4 LN:190214555
        @SQ SN:chr5 LN:181538259
        @SQ SN:chr6 LN:170805979
        @SQ SN:chr7 LN:159345973
        @SQ SN:chr8 LN:145138636
        @SQ SN:chr9 LN:138394717
        @SQ SN:chr10    LN:133797422
        @SQ SN:chr11    LN:135086622
        @SQ SN:chr12    LN:133275309
        @SQ SN:chr13    LN:114364328
        @SQ SN:chr14    LN:107043718
        @SQ SN:chr15    LN:101991189
        @SQ SN:chr16    LN:90338345
        @SQ SN:chr17    LN:83257441
        @SQ SN:chr18    LN:80373285
        @SQ SN:chr19    LN:58617616
        @SQ SN:chr20    LN:64444167
        @SQ SN:chr21    LN:46709983
        @SQ SN:chr22    LN:50818468
        @SQ SN:chrX LN:156040895
        @SQ SN:chrY LN:57227415
        @SQ SN:chrM LN:16569
        @SQ SN:chr1_KI270706v1_random   LN:175055
        @SQ SN:chr1_KI270707v1_random   LN:32032
        @SQ SN:chr1_KI270708v1_random   LN:127682
        @SQ SN:chr1_KI270709v1_random   LN:66860
        @SQ SN:chr1_KI270710v1_random   LN:40176
        @SQ SN:chr1_KI270711v1_random   LN:42210
        @SQ SN:chr1_KI270712v1_random   LN:176043
        @SQ SN:chr1_KI270713v1_random   LN:40745
        @SQ SN:chr1_KI270714v1_random   LN:41717
        @SQ SN:chr2_KI270715v1_random   LN:161471
        @SQ SN:chr2_KI270716v1_random   LN:153799
        @SQ SN:chr3_GL000221v1_random   LN:155397
        @SQ SN:HLA-DRB1*09:21   LN:16039
    ........................................................................
        @SQ SN:HLA-DRB1*10:01:01    LN:13501
        @SQ SN:HLA-DRB1*11:01:01    LN:13921
        @SQ SN:HLA-DRB1*11:01:02    LN:13931
        @SQ SN:HLA-DRB1*11:04:01    LN:13919
        @SQ SN:HLA-DRB1*15:01:01:01 LN:11080
        @SQ SN:HLA-DRB1*15:01:01:02 LN:11571
        @SQ SN:HLA-DRB1*15:01:01:03 LN:11056
        @SQ SN:HLA-DRB1*15:03:01:01 LN:11567
        @SQ SN:HLA-DRB1*15:03:01:02 LN:11569
        @SQ SN:HLA-DRB1*16:02:01    LN:11005
        @RG ID:DT1226_H2NC7CCX2_L5_1.fastq  SM:DT1226_H2NC7CCX2_L5_1.fastq  PL:Illumina
        @PG ID:bwa  PN:bwa  VN:0.7.12-r1039 CL:bwa mem -M -t 44 -R @RG\tID:DT1226_H2NC7CCX2_L5_1.fastq\tSM:DT1226_H2NC7CCX2_L5_1.fastq\tPL:Illumina Homo_sapiens_assembly38.fasta DT1226_H2NC7CCX2_L5_1.fastq.gz DT1226_H2NC7CCX2_L5_2.fastq.gz
@PG ID:MarkDuplicates   VN:Version:4.1.4.0  CL:MarkDuplicates  --INPUT DT1226_H2NC7CCX2_L5_1.fastq-sorted.bam --OUTPUT DT1226_H2NC7CCX2_L5_1.fastq.bam --METRICS_FILE DT1226_H2NC7CCX2_L5_1.fastq.bam.metrics --CREATE_INDEX true  --MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP 50000 --MAX_FILE_HANDLES_FOR_READ_ENDS_MAP 8000 --SORTING_COLLECTION_SIZE_RATIO 0.25 --TAG_DUPLICATE_SET_MEMBERS false --REMOVE_SEQUENCING_DUPLICATES false --TAGGING_POLICY DontTag --CLEAR_DT true --DUPLEX_UMI false --ADD_PG_TAG_TO_READS true --REMOVE_DUPLICATES false --ASSUME_SORTED false --DUPLICATE_SCORING_STRATEGY SUM_OF_BASE_QUALITIES --PROGRAM_RECORD_ID MarkDuplicates --PROGRAM_GROUP_NAME MarkDuplicates --READ_NAME_REGEX <optimized capture of last three ':' separated fields as numeric values> --OPTICAL_DUPLICATE_PIXEL_DISTANCE 100 --MAX_OPTICAL_DUPLICATE_SET_SIZE 300000 --VERBOSITY INFO --QUIET false --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 2 --MAX_RECORDS_IN_RAM 500000 --CREATE_MD5_FILE false --GA4GH_CLIENT_SECRETS client_secrets.json --help false --version false --showHidden false --USE_JDK_DEFLATER false --USE_JDK_INFLATER false   PN:MarkDuplicates
@PG ID:GATK ApplyBQSR   VN:4.1.4.0  CL:ApplyBQSR  --output DT1226_H2NC7CCX2_L5_1.fastq_bqsr.bam --bqsr-recal-file DT1226_H2NC7CCX2_L5_1.fastq_recal_data.table --input DT1226_H2NC7CCX2_L5_1.fastq.bam --create-output-bam-index true  --preserve-qscores-less-than 6 --use-original-qualities false --quantize-quals 0 --round-down-quantized false --emit-original-quals false --global-qscore-prior -1.0 --interval-set-rule UNION --interval-padding 0 --interval-exclusion-padding 0 --interval-merging-rule ALL --read-validation-stringency SILENT --seconds-between-progress-updates 10.0 --disable-sequence-dictionary-validation false --create-output-bam-md5 false --create-output-variant-index true --create-output-variant-md5 false --lenient false

Unique bed

#chr
chr1
chr2
chr3
chr4
chr5
chr6
chr7
chr8
chr9
chr10
chr11
chr12
chr13
chr14
chr15
chr16
chr17
chr18
chr19
chr20
chr21
chr22
chrX
ADD REPLY

Login before adding your answer.

Traffic: 2032 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6