Question: Calculate Per Exon/Per Gene Coverage
4
gravatar for thecuriousbiologist
6.2 years ago by
United States
thecuriousbiologist430 wrote:

Hi,

I have a list of exons (in BED format) and I wish to find the coverage per exon from a BAM file. I was looking at Bedtools coverageBED but I don't think that gives me the per exon coverage.

Are there any suggestions on how I should go about calculating the coverage of each exon ?

Also, eventually, I have to also calculate per gene coverage. How do I go about calculating the per gene coverage from the per exon coverage obtained above ? Do I simply take the average of the coverage across all exons of the gene ?

gene exon coverage • 17k views
ADD COMMENTlink modified 6.2 years ago by biorepine1.4k • written 6.2 years ago by thecuriousbiologist430

What type of data are these? RNA-seq? Exome capture? Genomic DNA? And what are you going to do with the coverage numbers?

ADD REPLYlink written 6.2 years ago by Sean Davis25k
6
gravatar for Irsan
6.2 years ago by
Irsan6.8k
Amsterdam
Irsan6.8k wrote:
bedtools coverage -abam sample.bam -b exons.bed -count

should give you the amount of reads that in you alignment for each region in your bed file. So if each region in your bed is an exon than you have coverage numbers for each exon. If you want to have it per gene than download a bed-file with RefSeq genes from UCSC table browser and do

bedtools coverage -abam sample.bam -b genes.bed -count

You can get the genes.bed file by going to UCSC table browser, then

clade = mammal genome = human assembly = hg19 group = gene and gene prediction tracks track = RefSeq genes table = kgXref output format = select fields from primary and selected tables output file = genes.bed

press "get output"

select

geneSymbol knownGene

press "allow selection from checked tables"

select

geneSymbol chrom txStart txEnd

and finally press "get output"

ADD COMMENTlink modified 6.2 years ago • written 6.2 years ago by Irsan6.8k

Hello Irsan, I followed you answer and got the counts for each feature but I have multiple row for each feature and I don't know if I have to take average of all or just the one with the highest count. Can you guide me what I should do?? my header of count file is like:

1       67092164        67231852        C1orf141        12
1       67092175        67127261        C1orf141        2
1       67092175        67127261        C1orf141        2
1       67092396        67127261        C1orf141        2
1       201283451       201332993       PKP1    141
1       201283451       201332993       PKP1    141
1       201283511       201330288       PKP1    72
1       201283702       201328836       PKP1    69
ADD REPLYlink written 2.8 years ago by Bioinformatist Newbie230

@Irsan Can you help me here..?

ADD REPLYlink modified 2.8 years ago • written 2.8 years ago by Bioinformatist Newbie230

If you need average coverage to all regions just pipe your output to simply awk script like -

awk '{sum+=$5}END{Print "Average coverage is:", sum/NR}'

ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by Paul1.3k
4
gravatar for Pierre Lindenbaum
6.2 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum119k wrote:

I wrote that tool https://code.google.com/p/variationtoolkit/wiki/BedDepth https://github.com/lindenb/jvarkit/wiki/BamStats04 it takes a bed and a set of BAMs and output a few statistics about the regions of interest:

-f (bam-file) add this bam file. Can be called multiple times
-m (min-qual uint32) (optional) min SAM record Quality.
-M (max-qual uint32) (optional) max SAM record Quality.
-D (min-depth) (optional) min depth. Can be called multiple times
-d do NOT discard duplicates

.

$ echo -e "seq2\t100\t200\nseq1\t0\t10\nseq2\t1500\t1550" |\
beddepth -D 5 -D 100 -D 200 -f  sorted.bam -f ex1.bam |\
verticalize 


>>>     2
$1      #chrom                                  seq2
$2      start                                   100
$3      end                                     200
$4      size(sorted.bam)                        100
$5      covered[depth:0](sorted.bam)            100
$6      percent_covered[depth:0](sorted.bam)    1
$7      covered[depth:5](sorted.bam)            100
$8      percent_covered[depth:5](sorted.bam)    1
$9      covered[depth:100](sorted.bam)          0
$10     percent_covered[depth:100](sorted.bam)  0
$11     covered[depth:200](sorted.bam)          0
$12     percent_covered[depth:200](sorted.bam)  0
$13     min(sorted.bam)                         18
$14     max(sorted.bam)                         93
$15     median(sorted.bam)                      57
$16     mean(sorted.bam)                        57.27
$17     size(ex1.bam)                           100
$18     covered[depth:0](ex1.bam)               100
$19     percent_covered[depth:0](ex1.bam)       1
$20     covered[depth:5](ex1.bam)               100
$21     percent_covered[depth:5](ex1.bam)       1
$22     covered[depth:100](ex1.bam)             0
$23     percent_covered[depth:100](ex1.bam)     0
$24     covered[depth:200](ex1.bam)             0
$25     percent_covered[depth:200](ex1.bam)     0
$26     min(ex1.bam)                            6
$27     max(ex1.bam)                            31
$28     median(ex1.bam)                         19
$29     mean(ex1.bam)                           19.09
<<<     2

>>>     3
$1      #chrom                                  seq1
$2      start                                   0
$3      end                                     10
$4      size(sorted.bam)                        10
$5      covered[depth:0](sorted.bam)            10
$6      percent_covered[depth:0](sorted.bam)    1
$7      covered[depth:5](sorted.bam)            8
$8      percent_covered[depth:5](sorted.bam)    0.8
$9      covered[depth:100](sorted.bam)          0
$10     percent_covered[depth:100](sorted.bam)  0
$11     covered[depth:200](sorted.bam)          0
$12     percent_covered[depth:200](sorted.bam)  0
$13     min(sorted.bam)                         3
$14     max(sorted.bam)                         15
$15     median(sorted.bam)                      12
$16     mean(sorted.bam)                        9.3
$17     size(ex1.bam)                           10
$18     covered[depth:0](ex1.bam)               10
$19     percent_covered[depth:0](ex1.bam)       1
$20     covered[depth:5](ex1.bam)               2
$21     percent_covered[depth:5](ex1.bam)       0.2
$22     covered[depth:100](ex1.bam)             0
$23     percent_covered[depth:100](ex1.bam)     0
$24     covered[depth:200](ex1.bam)             0
$25     percent_covered[depth:200](ex1.bam)     0
$26     min(ex1.bam)                            1
$27     max(ex1.bam)                            5
$28     median(ex1.bam)                         4
$29     mean(ex1.bam)                           3.1
<<<     3

>>>     4
$1      #chrom                                  seq2
$2      start                                   1500
$3      end                                     1550
$4      size(sorted.bam)                        50
$5      covered[depth:0](sorted.bam)            50
$6      percent_covered[depth:0](sorted.bam)    1
$7      covered[depth:5](sorted.bam)            50
$8      percent_covered[depth:5](sorted.bam)    1
$9      covered[depth:100](sorted.bam)          33
$10     percent_covered[depth:100](sorted.bam)  0.66
$11     covered[depth:200](sorted.bam)          0
$12     percent_covered[depth:200](sorted.bam)  0
$13     min(sorted.bam)                         45
$14     max(sorted.bam)                         129
$15     median(sorted.bam)                      114
$16     mean(sorted.bam)                        97.02
$17     size(ex1.bam)                           50
$18     covered[depth:0](ex1.bam)               50
$19     percent_covered[depth:0](ex1.bam)       1
$20     covered[depth:5](ex1.bam)               50
$21     percent_covered[depth:5](ex1.bam)       1
$22     covered[depth:100](ex1.bam)             0
$23     percent_covered[depth:100](ex1.bam)     0
$24     covered[depth:200](ex1.bam)             0
$25     percent_covered[depth:200](ex1.bam)     0
$26     min(ex1.bam)                            15
$27     max(ex1.bam)                            43
$28     median(ex1.bam)                         38
$29     mean(ex1.bam)                           32.34
<<<     4
ADD COMMENTlink modified 5.1 years ago • written 6.2 years ago by Pierre Lindenbaum119k

Hello,

 

I compiled your package and tried it:

# java -jar ~/programs/jvarkit/dist/bamstats04.jar BED=gdr18_k75-85_NHC_ORFs-all.bed I=GDR-18.sort.grp.md.bam > GDR18.coverage.out
[Sat Jun 07 21:27:02 EDT 2014] com.github.lindenb.jvarkit.tools.bamstats04.BamStats04 IN=GDR-18.sort.grp.md.bam BEDILE=gdr18_k75-85_NHC_ORFs-all.bed    NO_DUP=true NO_ORPHAN=true NO_VENDOR=true MMQ=0 MIN_COVERAGE=0 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
[Sat Jun 07 21:27:02 EDT 2014] Executing as adrian@biopower2 on Linux 2.6.32-358.el6.x86_64 amd64; OpenJDK 64-Bit Server VM 1.7.0_09-icedtea-mockbuild_2013_01_16_18_52-b00; Picard version: null JdkDeflater
INFO    2014-06-07 21:27:02     BamStats04      Scanning GDR-18.sort.grp.md.bam
ERROR   2014-06-07 21:27:02     BamStats04
java.lang.IllegalArgumentException: Invalid reference index -1
        at htsjdk.samtools.QueryInterval.<init>(QueryInterval.java:24)
        at htsjdk.samtools.SAMFileReader.query(SAMFileReader.java:397)
        at htsjdk.samtools.SAMFileReader.queryOverlapping(SAMFileReader.java:418)
        at com.github.lindenb.jvarkit.tools.bamstats04.BamStats04.doWork(BamStats04.java:93)
        at com.github.lindenb.jvarkit.util.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:178)
        at com.github.lindenb.jvarkit.tools.bamstats04.BamStats04.main(BamStats04.java:189)
[Sat Jun 07 21:27:02 EDT 2014] com.github.lindenb.jvarkit.tools.bamstats04.BamStats04 done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=2058027008

Any ideas on how to fix this?

ADD REPLYlink modified 4.9 years ago • written 4.9 years ago by Adrian Pelin2.2k

the chromosomes defined in your BED and in GDR-18.sort.grp.md.bam are not the same.

ADD REPLYlink written 4.9 years ago by Pierre Lindenbaum119k

Alright, makes sense. I fixed that, new problem now:

java -jar ~/programs/jvarkit/dist/bamstats04.jar BED=gdr18_k75-85_NHC_conc-allORFs.bed I=GDR-18.sort.grp.md.bam > GDR18.coverage.out
[Mon Jun 09 17:09:06 EDT 2014] com.github.lindenb.jvarkit.tools.bamstats04.BamStats04 IN=GDR-18.sort.grp.md.bam BEDILE=gdr18_k75-85_NHC_conc-allORFs.bed    NO_DUP=true NO_ORPHAN=true NO_VENDOR=true MMQ=0 MIN_COVERAGE=0 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
[Mon Jun 09 17:09:06 EDT 2014] Executing as adrian@biopower2 on Linux 2.6.32-358.el6.x86_64 amd64; OpenJDK 64-Bit Server VM 1.7.0_09-icedtea-mockbuild_2013_01_16_18_52-b00; Picard version: null JdkDeflater
INFO    2014-06-09 17:09:06     BamStats04      Scanning GDR-18.sort.grp.md.bam
ERROR   2014-06-09 17:09:08     BamStats04
java.lang.IllegalStateException: Inappropriate call if not paired read
        at htsjdk.samtools.SAMRecord.requireReadPaired(SAMRecord.java:654)
        at htsjdk.samtools.SAMRecord.getProperPairFlag(SAMRecord.java:662)
        at com.github.lindenb.jvarkit.tools.bamstats04.BamStats04.doWork(BamStats04.java:100)
        at com.github.lindenb.jvarkit.util.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:178)
        at com.github.lindenb.jvarkit.tools.bamstats04.BamStats04.main(BamStats04.java:189)
[Mon Jun 09 17:09:08 EDT 2014] com.github.lindenb.jvarkit.tools.bamstats04.BamStats04 done. Elapsed time: 0.04 minutes.
Runtime.totalMemory()=2058027008

Now I am not sure how to fix this, but what happend is that since some of my PE reads were overlapping, I merged them. Then I mapped (bwa) the SE and PE reads and merged the 2 bam files into one. Is that why there is an error?

Thanks.

ADD REPLYlink modified 4.9 years ago • written 4.9 years ago by Adrian Pelin2.2k

You're right. I've fixed the code to handle non-paired-ends, thanks: https://github.com/lindenb/jvarkit/commit/72c4f302902e44df15f20a1d181cb551ef9dba37#diff-8d505bb25f8d3b768022df93d1527778

ADD REPLYlink written 4.9 years ago by Pierre Lindenbaum119k
0
gravatar for Ashutosh Pandey
6.2 years ago by
Philadelphia
Ashutosh Pandey11k wrote:

I use GATK DepthofCoverage for calculating coverage. You can provide multiple files (one for exons, one for genes) and it will calculate the coverage for you. Gene coverage could be different from average across all the exons. Reads may be aligned to Introns and you may miss them if you only include exons. Link GATK: http://gatkforums.broadinstitute.org/discussion/40/depthofcoverage-v3-0-how-much-data-do-i-have

ADD COMMENTlink written 6.2 years ago by Ashutosh Pandey11k

Be careful when using GATK DepthofCoverage. It has filters with no documentation, which filter out some of your reads without any explained mechanism. Sometimes those filtered-out reads may be important, but the GATK has no option to deactivate the filters.

ADD REPLYlink written 4.5 years ago by earthworm0
0
gravatar for biorepine
6.2 years ago by
biorepine1.4k
Spain
biorepine1.4k wrote:

It is best to calculate normalized expression for each exon. This script would do the job. Good luck!

ADD COMMENTlink written 6.2 years ago by biorepine1.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1095 users visited in the last hour