Question: Correlation Of Fpkm And Length Normalized Transcript Mapped Read Count
gravatar for johannes.helmuth
8.3 years ago by
Berlin, Germany
johannes.helmuth110 wrote:


in the process of estimating expression for a 16 human tissue dataset ("Human Body Map 2.0 GSE30611") I used different methods to estimate the expression of the genes. After mapping against hg19 genome version, I used the UCSC provided refseq annotation for hg19 to count mapped reads for ~40,000 human genes in two ways:

  1. Counting with cufflinks outputs a Fragments Per Kilobase Per Million mapped fragments value (FPKM) for each transcript. The FPKM value basically accounts for library size and also the length of the transcript comprising all the annotated exons + some additional likelihood estimator to assign reads (see here).

  2. Counting mapped reads with bedtools and divide a transcript's mapped count by the sum of all the exon lengths. This gained a length normalized expression estimate to compare between genes.

However, the correlation of (1.) and (2.) is always around ~0.65 between same tissues (technically the same experiment). I would expect this correlation to be > 0.9.

Below, I plotted (2.) against (1.) for all ~40,000 transcripts. It seems like normal length normalization is simply overestimating some expression.

Can someone shed light on this? Is this due to the library normalization? If I calculate correlation for the outputs of the two different methods on the same experiement, I think this should outrule this normalization, or not?

length normalized read count ~ FPKM plotted in R

    smoothScatter( log10(fpkm + 1), log10(bedtools + 1), colramp=tim.colors, xlab="FPKM", ylab="Bedtools with length norm")
bedtools cufflinks next-gen • 6.5k views
ADD COMMENTlink modified 2.2 years ago by Biostar ♦♦ 20 • written 8.3 years ago by johannes.helmuth110

when you do the bedtools analysis, what is the input alignment file?

ADD REPLYlink written 8.3 years ago by Istvan Albert ♦♦ 85k

@Istvan Albert: I use the -abam flag to provide the accepted_hits.bam file from the tophat output

ADD REPLYlink written 8.3 years ago by johannes.helmuth110
gravatar for matted
8.3 years ago by
Boston, United States
matted7.3k wrote:

Two possibilities occur to me:

1. Cufflinks attempts to correct for sequence bias and other effects, while the simple BED region read counting will not.

A very rough example is something like:

Gene read count = Gene length * (FPKM + f(GC content) + g(sequence) + ...)

The gene read count (from BEDtools) includes many observed effects, including the "true" expression value (estimated by Cufflinks). When you compare Gene read count / Gene length to FPKM, the correlation is made worse due to the noise from the extra terms.

2. I'd also be concerned about your procedure to divide "by the sum of all the exon lengths." Unless you are doing a lot more than you say, this will never be exactly correct. Cufflinks is estimating the exon usage pattern and will report FPKMs based on its solution to that optimization problem.

Consider this example: three exons of 1000 bp each, but one exon is rarely or never used. Cufflinks verifies this from your data and normalizes by 2000 bp (two exons used). Your BEDtools script sees all the exons, doesn't check if they're used or not, and normalizes by 3000 bp. This makes the BEDtools number underestimate the true value (and Cufflinks's estimate), which your plot does seem to suggest.

ADD COMMENTlink written 8.3 years ago by matted7.3k

Thanks for this thorough answer.

  1. I thought this has an effect on the expression estimate in cufflinks. Nevertheless I did not expect this disrupting the correlation so much.

  2. When using RefSeq, all transcripts of a gene a annotated and you are right that dividing by all exon lengths is not correct in some cases. If, like you said, one exon of a transcript is rarely used, this transcript may not be present. Instead another transcript comprising these exons may be the one of interest and contributing to the observed read counts.

ADD REPLYlink written 8.3 years ago by johannes.helmuth110
gravatar for JC
8.3 years ago by
JC12k wrote:

Nice plot, but your read counting is not really normalized, you need to reduce the bias in mapping, GC content and proportions. Cufflinks already implement all of them. Maybe the read counting can be improved using R:bioconductor edgeR or DESeq packages.

ADD COMMENTlink written 8.3 years ago by JC12k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1621 users visited in the last hour