Question

TPM normalization result

0

Entering edit mode

20 months ago

JACKY ▴ 140

I have counts data. I need to run software in R that accepts only normalized data. I normalized to TPM with this code:

rpkm <- apply(X = subset(raw1),
               MARGIN = 2,
               FUN = function(x) {
                 10^9 * x / geneLengths / sum(as.numeric(x))
               })

TPM1 <- apply(rpkm, 2, function(x) x / sum(as.numeric(x)) * 10^6) %>% as.data.frame()

However, check out the results. This is the expression of CD274 in both counts data (raw1) and the normalized TPM data:

> raw1['CD274',]
       Pt1 Pt10 Pt101 Pt103 Pt106 Pt11 Pt17 Pt18  Pt2 Pt24 Pt26 Pt27 Pt28 Pt29 Pt3 Pt30 Pt31 Pt34 Pt36 Pt37 Pt38 Pt39 Pt4 Pt44 Pt46 Pt47
CD274 1484  290  1421   251   203  888  608 1203 1340 1021  182  170  291  401 140  117  582 1177  191  152  111   24 187  705 1122  694
      Pt48 Pt49 Pt5 Pt52 Pt59 Pt62 Pt65 Pt66 Pt67 Pt72 Pt77 Pt78 Pt79 Pt8 Pt82 Pt84 Pt85 Pt89 Pt9 Pt90 Pt92 Pt94 Pt98
CD274  224 1122 501  268 1277  270  705  276   88  157 2564   25  251 255  484   96   37  180 169  949 1477  128  321
> TPM1['CD274',]
          Pt1     Pt10    Pt101    Pt103   Pt106     Pt11     Pt17    Pt18      Pt2     Pt24     Pt26     Pt27     Pt28    Pt29      Pt3
CD274 35.0266 4.280535 28.67831 3.449004 4.33621 19.67596 13.56328 25.0671 34.08708 17.27277 4.702501 4.485883 7.244041 8.91973 3.374477
          Pt30     Pt31     Pt34     Pt36     Pt37     Pt38      Pt39     Pt4     Pt44     Pt46     Pt47     Pt48     Pt49      Pt5
CD274 3.103927 13.60881 20.59666 4.317299 3.056179 2.427168 0.5931633 3.93912 15.99866 15.20747 17.17709 4.543694 21.67822 13.52313
          Pt52     Pt59     Pt62     Pt65     Pt66     Pt67     Pt72     Pt77      Pt78     Pt79      Pt8     Pt82    Pt84     Pt85
CD274 6.388511 27.02948 6.314665 13.77411 7.229454 1.796893 4.571327 56.26996 0.5742323 6.085954 5.998064 12.66232 2.63342 0.785415
          Pt89      Pt9     Pt90     Pt92     Pt94     Pt98
CD274 3.325707 4.490217 14.52295 40.60652 3.159773 7.993074

something doesn't make sense. Look at Pt103 and Pt106 in both of them. Pt103 has higher expression in the raw1 data, but in TPM Pt106 has higher expression. How could this be? is my normalization wrong or could it happen due to gene length?

normalization r TPM • 593 views

ADD COMMENT • link updated 20 months ago by LChart 3.9k • written 20 months ago by JACKY ▴ 140

score 0 · Answer 1 · 2022-08-31

0

Entering edit mode

20 months ago

LChart 3.9k

sum(as.numeric(x)) is counting up the total number of fragments, so Pt103 almost certainly has more fragments than Pt106, so that the proportion of fragments originating from CD274 is higher in Pt106.

This is actually why you're performing normalization.

ADD COMMENT • link 20 months ago by LChart 3.9k

0

Entering edit mode

LChart I see, now I get it. I thought the whole process was faulty becasue of this and had to make sure.

Thank you!

ADD REPLY • link 20 months ago by JACKY ▴ 140

1

Entering edit mode

the way to verify this is to look at the lengths of the two transcripts

the length is the only normalization factor when comparing transcripts within the same sample. The raw_count/length ratios ought to show the same behavior.

ADD REPLY • link 20 months ago by Istvan Albert 100k