Calculation of gene coverage from BAM file
4
0
Entering edit mode
2.5 years ago

Hello, I am looking for a calculation of gene coverage from the BAM file or any other file used in whole-exome data analysis. Please tell me if anyone knows about this. Thanks in advance.

WES • 2.8k views
ADD COMMENT
1
Entering edit mode
2.5 years ago

One way, via bam2bed and bedmap:

$ bam2bed --reduced < reads.bam | bedmap --echo --count --delim '\t' genes.bed - > answer.bed

The count of reads overlapping a gene by one or more bases will be in the last column of answer.bed.

To generate genes.bed, this will depend on your assembly and how you name chromosomes. Gencode is one source for human and mouse gene annotations in GFF format, which can be converted to BED via gff2bed, e.g. for hg38:

$ wget -qO- ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_38/gencode.v38.annotation.gff3.gz \
    | gunzip --stdout - \
    | awk '$3 == "gene"' - \
    | gff2bed - \
    > genes.bed
ADD COMMENT
0
Entering edit mode

I got this answer:

chr1    11868   14409   ENSG00000223972.5       .       +       HAVANA  gene    .       ID=ENSG00000223972.5;gene_id=ENSG00000223972.5_1;gene_type=transcribed_unprocessed_pseudogene;gene_name=DDX11L1;level=2;hgnc_id=HGNC:37102;havana_gene=OTTHUMG00000000961.2_1;remap_status=full_contig;remap_num_mappings=1;remap_target_status=overlap 412

Please tell me if 412 is the gene coverage.

ADD REPLY
0
Entering edit mode

412 reads overlap that element.

ADD REPLY
0
Entering edit mode

So this is variant supporting reads, not gene coverage, right?

ADD REPLY
0
Entering edit mode

what is your definition of "gene coverage"?

coverage typically means multiple features (say reads) covering a gene, the count is how many of the features covered the interval

ADD REPLY
0
Entering edit mode

In case you do not only want to count reads overlapping genes, I added a second answer that shows how to calculate additional statistics.

If you do not want coverage of reads over genes, please edit your question so that others may be able to help.

ADD REPLY
0
Entering edit mode
2.5 years ago

use a tool like featureCounts:

featureCounts: an efficient general purpose program for assigning sequence reads to genomic features

ADD COMMENT
0
Entering edit mode
2.5 years ago

If you want more statistics than just counts of reads that overlap gene annotations, perhaps the following answer will help.

Starting with a generic BAM file, it can be converted to BED via bam2bed.

The 13th column (QUAL) can be converted from its printable characters back to an average Phred quality score, and then swapped with the fifth (score) column, to following the UCSC specification for BED files.

This numerical value can subsequently be used for mapping, applying score statistical functions with bedmap (shown further below):

$ bam2bed < reads.bam | awk -v FS="\t" -v OFS="\t" 'BEGIN{ for (n=0; n<256; n++) ord[sprintf("%c",n)]=n }{ old=$5; l=split($13,q,""); s=0; for(n=1; n<=l; n++) { s+=ord[q[n]]-33; } $5=s/l; $13=old; print $0; }' > reads.bed

You can start with genes as described in another answer in this thread, which are in a file called genes.bed.

Now you can map your BED file containing genes-of-interest to the reads from the BAM file:

$ bedmap --skip-unmapped --delim '\t' --echo --count --bases-uniq --echo-ref-size --bases-uniq-f --mean genes.bed reads.bed > answer.bed

The file answer.bed will contain summary statistics for reads mapped to genes and look something like this, unique to the makeup of your genes and reads:

$ head answer.bed
chr1    199064  200374  ... 404 1310    1310    1.000000    25
chr1    6853080 6870040 ... 204 9044    16960   0.533255    20
chr1    8004008 8027135 ... 507 22308   23127   0.964587    20
chr1    8061658 8098833 ... 366 37175   37175   1.000000    25
chr1    8174215 8211962 ... 369 34253   37747   0.907436    35
chr1    8873384 8890703 ... 434 17319   17319   1.000000    50
chr1    8874708 8890922 ... 420 16214   16214   1.000000    45
chr1    15833039    15850691    ... 271 17652   17652   1.000000    50
chr1    16643023    16645543    ... 266 2520    2520    1.000000    45
chr1    16665212    16667822    ... 281 2610    2610    1.000000    30

The first three columns are the intervals defined by genes. These and metadata columns come from using --echo.

Columns in between will be metadata about the gene (symbol, strand, etc.) and depend on the source of gene annotations.

The last five columns come from the remaining options (--count, --bases-uniq, etc.) and are as follows:

  • The number of reads in the BAM file that overlap an interval in the BED file by one or more bases
  • The number of bases in the interval in the BED file that have coverage by reads from the BAM file
  • The length of the interval in the BED file
  • The fraction of bases in the interval in the BED file that have coverage by reads from the BAM file
  • The (arithmetic) mean of Phred quality scores of reads over the region

The order of the specified bedmap operands --count, --bases-uniq, --echo-ref-size, --bases-uniq-f, --mean write the aforementioned five column values in that same order.

Other operations are available and can be added to the bedmap command, if useful. Additionally, if one base of read coverage is too lenient, additional operands are available to further constrain the overlap criterion.

Run bedmap --help and take a look at the "Overlap Options" and other sections, or take a look at the online documentation for a more complete walkthrough.

ADD COMMENT
0
Entering edit mode
2.5 years ago
igor 13k

Have you checked some of the previous related discussions?

ADD COMMENT

Login before adding your answer.

Traffic: 2058 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6