Calculate Per Exon/Per Gene Coverage
4
4
Entering edit mode
8.7 years ago

Hi,

I have a list of exons (in BED format) and I wish to find the coverage per exon from a BAM file. I was looking at Bedtools coverageBED but I don't think that gives me the per exon coverage.

Are there any suggestions on how I should go about calculating the coverage of each exon ?

Also, eventually, I have to also calculate per gene coverage. How do I go about calculating the per gene coverage from the per exon coverage obtained above ? Do I simply take the average of the coverage across all exons of the gene ?

exon gene coverage • 22k views
ADD COMMENT
0
Entering edit mode

What type of data are these? RNA-seq? Exome capture? Genomic DNA? And what are you going to do with the coverage numbers?

ADD REPLY
9
Entering edit mode
8.7 years ago
Irsan ★ 7.4k
bedtools coverage -abam sample.bam -b exons.bed -count

should give you the amount of reads that in you alignment for each region in your bed file. So if each region in your bed is an exon than you have coverage numbers for each exon. If you want to have it per gene than download a bed-file with RefSeq genes from UCSC table browser and do

bedtools coverage -abam sample.bam -b genes.bed -count

You can get the genes.bed file by going to UCSC table browser, then

  • clade = mammal
  • genome = human
  • assembly = hg19
  • group = gene and gene prediction tracks
  • track = RefSeq genes
  • table = kgXref
  • output format = select fields from primary and selected tables
  • output file = genes.bed

press "get output"

select

  • geneSymbol
  • knownGene

press "allow selection from checked tables"

select

  • geneSymbol
  • chrom
  • txStart
  • txEnd

and finally press "get output"

ADD COMMENT
0
Entering edit mode

Hello Irsan, I followed you answer and got the counts for each feature but I have multiple row for each feature and I don't know if I have to take average of all or just the one with the highest count. Can you guide me what I should do?? my header of count file is like:

1       67092164        67231852        C1orf141        12
1       67092175        67127261        C1orf141        2
1       67092175        67127261        C1orf141        2
1       67092396        67127261        C1orf141        2
1       201283451       201332993       PKP1    141
1       201283451       201332993       PKP1    141
1       201283511       201330288       PKP1    72
1       201283702       201328836       PKP1    69
ADD REPLY
0
Entering edit mode

@Irsan Can you help me here..?

ADD REPLY
0
Entering edit mode

If you need average coverage to all regions just pipe your output to simply awk script like -

awk  '{sum+=$5}END{Print "Average coverage is:", sum/NR}'
ADD REPLY
4
Entering edit mode
8.7 years ago

I wrote that tool https://code.google.com/p/variationtoolkit/wiki/BedDepth https://github.com/lindenb/jvarkit/wiki/BamStats04 it takes a bed and a set of BAMs and output a few statistics about the regions of interest:

-f (bam-file) add this bam file. Can be called multiple times
-m (min-qual uint32) (optional) min SAM record Quality.
-M (max-qual uint32) (optional) max SAM record Quality.
-D (min-depth) (optional) min depth. Can be called multiple times
-d do NOT discard duplicates

 

$ echo -e "seq2\t100\t200\nseq1\t0\t10\nseq2\t1500\t1550" |\
    beddepth -D 5 -D 100 -D 200 -f  sorted.bam -f ex1.bam |\
    verticalize

ADD COMMENT
0
Entering edit mode

Hello,

I compiled your package and tried it:

java -jar ~/programs/jvarkit/dist/bamstats04.jar BED=gdr18_k75-85_NHC_ORFs-all.bed I=GDR-18.sort.grp.md.bam > GDR18.coverage.out
[Sat Jun 07 21:27:02 EDT 2014] com.github.lindenb.jvarkit.tools.bamstats04.BamStats04 IN=GDR-18.sort.grp.md.bam BEDILE=gdr18_k75-85_NHC_ORFs-all.bed    NO_DUP=true NO_ORPHAN=true NO_VENDOR=true MMQ=0 MIN_COVERAGE=0 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
[Sat Jun 07 21:27:02 EDT 2014] Executing as adrian@biopower2 on Linux 2.6.32-358.el6.x86_64 amd64; OpenJDK 64-Bit Server VM 1.7.0_09-icedtea-mockbuild_2013_01_16_18_52-b00; Picard version: null JdkDeflater
INFO    2014-06-07 21:27:02     BamStats04      Scanning GDR-18.sort.grp.md.bam
ERROR   2014-06-07 21:27:02     BamStats04
java.lang.IllegalArgumentException: Invalid reference index -1
        at htsjdk.samtools.QueryInterval.<init>(QueryInterval.java:24)
        at htsjdk.samtools.SAMFileReader.query(SAMFileReader.java:397)
        at htsjdk.samtools.SAMFileReader.queryOverlapping(SAMFileReader.java:418)
        at com.github.lindenb.jvarkit.tools.bamstats04.BamStats04.doWork(BamStats04.java:93)
        at com.github.lindenb.jvarkit.util.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:178)
        at com.github.lindenb.jvarkit.tools.bamstats04.BamStats04.main(BamStats04.java:189)
[Sat Jun 07 21:27:02 EDT 2014] com.github.lindenb.jvarkit.tools.bamstats04.BamStats04 done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=2058027008

Any ideas on how to fix this?

ADD REPLY
0
Entering edit mode

the chromosomes defined in your BED and in GDR-18.sort.grp.md.bam are not the same.

ADD REPLY
0
Entering edit mode

Alright, makes sense. I fixed that, new problem now:

java -jar ~/programs/jvarkit/dist/bamstats04.jar BED=gdr18_k75-85_NHC_conc-allORFs.bed I=GDR-18.sort.grp.md.bam > GDR18.coverage.out
[Mon Jun 09 17:09:06 EDT 2014] com.github.lindenb.jvarkit.tools.bamstats04.BamStats04 IN=GDR-18.sort.grp.md.bam BEDILE=gdr18_k75-85_NHC_conc-allORFs.bed    NO_DUP=true NO_ORPHAN=true NO_VENDOR=true MMQ=0 MIN_COVERAGE=0 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
[Mon Jun 09 17:09:06 EDT 2014] Executing as adrian@biopower2 on Linux 2.6.32-358.el6.x86_64 amd64; OpenJDK 64-Bit Server VM 1.7.0_09-icedtea-mockbuild_2013_01_16_18_52-b00; Picard version: null JdkDeflater
INFO    2014-06-09 17:09:06     BamStats04      Scanning GDR-18.sort.grp.md.bam
ERROR   2014-06-09 17:09:08     BamStats04
java.lang.IllegalStateException: Inappropriate call if not paired read
        at htsjdk.samtools.SAMRecord.requireReadPaired(SAMRecord.java:654)
        at htsjdk.samtools.SAMRecord.getProperPairFlag(SAMRecord.java:662)
        at com.github.lindenb.jvarkit.tools.bamstats04.BamStats04.doWork(BamStats04.java:100)
        at com.github.lindenb.jvarkit.util.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:178)
        at com.github.lindenb.jvarkit.tools.bamstats04.BamStats04.main(BamStats04.java:189)
[Mon Jun 09 17:09:08 EDT 2014] com.github.lindenb.jvarkit.tools.bamstats04.BamStats04 done. Elapsed time: 0.04 minutes.
Runtime.totalMemory()=2058027008

Now I am not sure how to fix this, but what happend is that since some of my PE reads were overlapping, I merged them. Then I mapped (bwa) the SE and PE reads and merged the 2 bam files into one. Is that why there is an error?

Thanks.

ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode
8.7 years ago

I use GATK DepthofCoverage for calculating coverage. You can provide multiple files (one for exons, one for genes) and it will calculate the coverage for you. Gene coverage could be different from average across all the exons. Reads may be aligned to Introns and you may miss them if you only include exons. Link GATK: http://gatkforums.broadinstitute.org/discussion/40/depthofcoverage-v3-0-how-much-data-do-i-have

ADD COMMENT
0
Entering edit mode

Be careful when using GATK DepthofCoverage. It has filters with no documentation, which filter out some of your reads without any explained mechanism. Sometimes those filtered-out reads may be important, but the GATK has no option to deactivate the filters.

ADD REPLY
0
Entering edit mode
8.7 years ago
biorepine ★ 1.5k

It is best to calculate normalized expression for each exon. This script would do the job. Good luck!

ADD COMMENT

Login before adding your answer.

Traffic: 1397 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6