Question: Count matrix from bigwig files - Application
gravatar for pyKey
6 weeks ago by
pyKey50 wrote:

Hi All,

Does anyone know a software that can calculate gene count matrix from a bigwig file? Preferred Python.


ADD COMMENTlink modified 29 days ago • written 6 weeks ago by pyKey50

ATpoint raised important issues here and I tend to agree with him. That being said, deeptools multiBigwigSummary used with the options --outRawCounts and --BED (with a GTF file) can summarize read counts on exons from bigwig files.

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by Carlo Yague4.8k

By best knowledge you cannot reliable do this. The bigwig format is a per-base or per-window format that summarizes coverage over these bases or windows but it is not suited to give exact counts over e.g. exons etc. That information you seek is lost.

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by ATpoint26k

Not sure if it is a good idea either, nevertheless, some people seem to do that Here . Just that, there is no code published.

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by pyKey50

I would never do this. Maybe if it was my own data, I was 100% sure how the bigwig was created and I had absolutely no other choice because the raw data were gone and I urgently needed that information, but else, no never. If you are going to do it anyway, treat whatever results you get with care and try to validate them with independent approaches.

ADD REPLYlink written 6 weeks ago by ATpoint26k

Thank you a lot for this hint. I would do as you suggested.

ADD REPLYlink written 6 weeks ago by pyKey50

Hi All,

So I ran multiBigwigSummary on the bigwig file. As follows:

  1. Prepared a BED file which contains just "gene" entries from the original gff file.
  2. Ran the above given this BED file.
  3. Got an outout BED with "gene" coordinates and a count column.
  4. Mapped the coordinates of each row to a single gene IDs from the gff file.

Then I checked the correlation between these counts and original expression data and it was around 0.7.

As a sanity check I also ran the same pipeline on a bigwig for a known file with already calculated gene expression levels. The correlation is around the same!

I wonder if there is a more reliable way to get gene expression from a BED file. Or is this information loss inherent?

Thank you,

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by pyKey50

Hi, Here are some thoughts about your problem:

With only the gene lines of a gtf, you'll also get also the introns. These are usually ignored while computing gene expression.

I don't know how deeptools summary works, but most gene counting tools rely only on uniquely mapping reads and discard ambiguous assignments. I guess it's hard to assess the ambiguity based on coverage rather than assigning a read by the portion of overlap to a certain feature.

BED files are generally not complex enough to model the multi-exon multi-transcript genes. You can convert a GTF/GFF file into BED12 format but loose the transcript-gene relationship.

ADD REPLYlink written 5 weeks ago by michael.ante3.5k
gravatar for pyKey
29 days ago by
pyKey50 wrote:

Update. Hope this helps. After looking more closely into multiBigwigSummary: read counts in each region are averaged over the number of bins. Meaning that the end score is dependent on the gene length. After normalizing gene counts from the original bam by length, now the correlation score is 0.95.

I agree that in most cases focusing on gene boundaries for getting BED scores is not the best approach. In Spombe yeast though, many genes are signle-exon which makes an acceptable "approximation" of expression. Of course, finding the exact expression values needs some workaround.


ADD COMMENTlink written 29 days ago by pyKey50
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1945 users visited in the last hour