How to calculate average coverage for all genes
3
1
Entering edit mode
5.4 years ago
Whoknows ▴ 880

Hi friends

I want to calculate average of read count / coverage for all genes, I mean creating a list which shows number of aligned reads for each location.

For better understanding, making a proportion for all genes location and an average list of read count for the locations like :

Location = Upsrtream       0-200         200-400       400-600   ...   Downstream

Average read counts=       20reads      15reads       32reads ....

How can I make this list from bam or wig file?

wig bam • 5.1k views
ADD COMMENT
0
Entering edit mode

you can get the read count for each coordinate from bam using samtools view option.

samtools view bamfile chr10:3100260-3110000 | wc -l

this u can loop in command line for all the genes u have

ADD REPLY
0
Entering edit mode

or just samtools view -c bamfile chr10:3100260-3110000 but your solution just count the reads, it doesn't consider the deletions, the clipping, etc...

ADD REPLY
3
Entering edit mode
5.4 years ago
Carlos Borroto ★ 2.0k

I recently found myself looking for the exact info. I too decided to go with bedtools. This is my final one liner assuming a pre-sorted bam file.

bedtools coverage -sorted -d -g human_g1k_v37.genome -a genes.bed -b my_sorted.bam \
  | sort -k4,4 \
  | bedtools groupby -g 4 -c 8 -o mean \
  | sort -k1,1

Genome file:

$ cat human_g1k_v37.genome
1   249250621
2   243199373
3   198022430
4   191154276
5   180915260
6   171115067
7   159138663
8   146364022
9   141213431
10  135534747
11  135006516
12  133851895
13  115169878
14  107349540
15  102531392
16  90354753
17  81195210
18  78077248
19  59128983
20  63025520
21  48129895
22  51304566
X   155270560
Y   59373566
MT  16569

Interval file:

$ head genes.bed
#chrom  start   end name    score   strand
1   2337184 2337293 PEX10   .   -
1   2337902 2338078 PEX10   .   -
1   2338138 2338414 PEX10   .   -
1   2339870 2340317 PEX10   .   -
1   2341789 2341910 PEX10   .   -
1   2342177 2342327 PEX10   .   -

Ouput:

ABCA12  193.5
ABCA3   195.35
ABCB1   12.67
ABCC8   176.47
ACADL   226.97
ACOX1   178.44
ADA 149.33
ADAMTS13    187.34
ADGRV1  196.86
AFF2    163.39
ADD COMMENT
0
Entering edit mode

Thank you for your comment. I tried to do same thing with you but I get out of order record error.

Do you know how can I fix it?

ADD REPLY
0
Entering edit mode

Did you start with a sorted BAM file?

ADD REPLY
0
Entering edit mode

Hi again,\ I use different way to find gene base coverages

  • bedtools bamtobed -i input_bam | bedtools coverage -mean -b - -a input_bed > output.file \ it gave me mean depth for each region that included in bed file.

Then I calculated length for each region in the bedtools output. After that I grouped file according to gene and obtained each genes total length and I divided depth by related genes length then I summed each value according to related genes that I obtain. While it sounds complicated, it's actually easy.

I hope this help guys who want to calculate gene-based coverage.

Thanks.

ADD REPLY
2
Entering edit mode
5.4 years ago

I wrote https://github.com/lindenb/jvarkit/wiki/BamStats05 ; It needs a BED and a BAM; The output looks like

#chrom  start   end gene length  mincov  maxcov  mean    nocoveragepb    percentcovered
ADD COMMENT
0
Entering edit mode

Is there any tool in galaxy for making such list??

ADD REPLY
1
Entering edit mode

Try NGS: RNA Analysis > htseq-count. The SAMtools and BEDtool items mentioned in this thread area also wrapped for Galaxy.

ADD REPLY
0
Entering edit mode

Thanks @pierre. Would you please share run command for this list?

ADD REPLY
1
Entering edit mode

The link is in @Pierre's answer above. Web page it leads to describes how to use that program.

If you need to use a galaxy based solution then the bedtools answer below would be appropriate.

ADD REPLY
1
Entering edit mode
5.4 years ago

How about using bedtools?

First, sort the bamfile:

samtools sort bamfile Output_sorted

Then run bedtools:

bedtools genomecov -ibam Output_sorted.bam

This will print something like (taken from the manual):

chr1  554304  554309  5
chr1  554309  554313  6
chr1  554313  554314  1
chr1  554315  554316  6
chr1  554316  554317  5
chr1  554317  554318  1
chr1  554318  554319  2
chr1  554319  554321  6
chr1  554321  554323  1
chr1  554323  554334  7

So from 554304 to 554309 there are 5 reads aligning etc.

If you have a second bed file of gene coordinates you can then report the intersection between both files with bedtools intersect

ADD COMMENT

Login before adding your answer.

Traffic: 3167 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6