Question

Correlation Of Fpkm And Length Normalized Transcript Mapped Read Count

3

Entering edit mode

11.7 years ago

johannes.helmuth ▴ 110

Hello,

in the process of estimating expression for a 16 human tissue dataset ("Human Body Map 2.0 GSE30611") I used different methods to estimate the expression of the genes. After mapping against hg19 genome version, I used the UCSC provided refseq annotation for hg19 to count mapped reads for ~40,000 human genes in two ways:

Counting with cufflinks outputs a Fragments Per Kilobase Per Million mapped fragments value (FPKM) for each transcript. The FPKM value basically accounts for library size and also the length of the transcript comprising all the annotated exons + some additional likelihood estimator to assign reads (see here).
Counting mapped reads with bedtools and divide a transcript's mapped count by the sum of all the exon lengths. This gained a length normalized expression estimate to compare between genes.

However, the correlation of (1.) and (2.) is always around ~0.65 between same tissues (technically the same experiment). I would expect this correlation to be > 0.9.

Below, I plotted (2.) against (1.) for all ~40,000 transcripts. It seems like normal length normalization is simply overestimating some expression.

Can someone shed light on this? Is this due to the library normalization? If I calculate correlation for the outputs of the two different methods on the same experiement, I think this should outrule this normalization, or not?

length normalized read count ~ FPKM plotted in R

    smoothScatter( log10(fpkm + 1), log10(bedtools + 1), colramp=tim.colors, xlab="FPKM", ylab="Bedtools with length norm")

cufflinks next-gen bedtools • 7.6k views

ADD COMMENT • link updated 5.6 years ago by Biostar 20 • written 11.7 years ago by johannes.helmuth ▴ 110

0

Entering edit mode

when you do the bedtools analysis, what is the input alignment file?

ADD REPLY • link 11.7 years ago by Istvan Albert 100k

0

Entering edit mode

@Istvan Albert: I use the -abam flag to provide the accepted_hits.bam file from the tophat output

ADD REPLY • link 11.7 years ago by johannes.helmuth ▴ 110

score 2 · Answer 1 · 2012-08-20

Two possibilities occur to me:

1. Cufflinks attempts to correct for sequence bias and other effects, while the simple BED region read counting will not.

A very rough example is something like:

Gene read count = Gene length * (FPKM + f(GC content) + g(sequence) + ...)

The gene read count (from BEDtools) includes many observed effects, including the "true" expression value (estimated by Cufflinks). When you compare Gene read count / Gene length to FPKM, the correlation is made worse due to the noise from the extra terms.

2. I'd also be concerned about your procedure to divide "by the sum of all the exon lengths." Unless you are doing a lot more than you say, this will never be exactly correct. Cufflinks is estimating the exon usage pattern and will report FPKMs based on its solution to that optimization problem.

Consider this example: three exons of 1000 bp each, but one exon is rarely or never used. Cufflinks verifies this from your data and normalizes by 2000 bp (two exons used). Your BEDtools script sees all the exons, doesn't check if they're used or not, and normalizes by 3000 bp. This makes the BEDtools number underestimate the true value (and Cufflinks's estimate), which your plot does seem to suggest.

score 0 · Answer 2 · 2012-08-20

0

Entering edit mode

11.7 years ago

JC 13k

Nice plot, but your read counting is not really normalized, you need to reduce the bias in mapping, GC content and proportions. Cufflinks already implement all of them. Maybe the read counting can be improved using R:bioconductor edgeR or DESeq packages.

ADD COMMENT • link 11.7 years ago by JC 13k