rna seq genome coverage
1
0
Entering edit mode
6.7 years ago
ea1402 ▴ 10

Hello All,

I need some statistics for my analysis before writing my own code I would like to know whether an existent tool is capable of doing this. I have aligned several samples to the human genome and I would like to know some statistics:

1) I would like to know what percent of bases in each exon is covered, and for the definition of coverage i would like to give a user defined number say n, then a base in an exon is covered if there are at least n reads supporting. So I need a vector of number exons and percent covered.

2) Next I would like to know how many/percent exons of a gene is 100% covered, so this needs to be a vector of number of genes

3) For the remaining reads not aligning to exons, I would like to find continous regions of coverage i.e. a 200 bp region supported with at least n number of reads.

I guess with some programming this is doable but requires some effort, I would like to know whether I can get these statistics using existent tools such bedtools,gatk, picard etc. Thank you all for your suggestions

RNA-Seq coverage picard gatk • 2.8k views
ADD COMMENT
0
Entering edit mode

1) per exon coverage, refer to : Calculate Per Exon/Per Gene Coverage

ADD REPLY
0
Entering edit mode

Thanks the coverageBed function from the bedtools somehow does what i need but is there an option there to set a minimum number of reads aligned to a base for it to be included in percentage coverage?

ADD REPLY
1
Entering edit mode
6.7 years ago

I would like to know what percent of bases in each exon is covered, and for the definition of coverage i would like to give a user defined number say n, then a base in an exon is covered if there are at least n reads supporting

You can write a simple python script for this. BUT you can also do it with bedtools alone.

a. First, get the bed file of exons using the dexseq_prepare_annotation.py script.

b. Then get the coverage for each base in a bedgraph format using -d and -split from your BAM( As its RNA-Seq )

c. Now intersect the BedGraph with exon BED file and use awk to filter for depth of coverage. So you will end up with the bases that overlaps with exons with x coverage.

d. To calculate what percent of each exon is covered, you can simply use groupBy of bedtools, groupBy exon coordinates and count number of bases covered. With this you will get to know what percent of each exon is covered by x number of reads and later on you can even groupBy gene ID.

For the remaining reads not aligning to exons, I would like to find continous regions of coverage i.e. a 200 bp

For this, simply substract the bedGraph with the exons ( using -v and without -d ) which gives all the continuous regions not overlapping the exons and you can use a length filter using awk.

Its a bit tricky but it should work perfectly.

ADD COMMENT
0
Entering edit mode

Thanks a lot I have already came up with a similar solution so I guess your post is really helpful. Thanks for the tip of subtraction I did not know that before.

ADD REPLY

Login before adding your answer.

Traffic: 1967 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6