Question

TPM-like normalization of time series RNAseq data to minimize dispersion of counts within replicates

0

Entering edit mode

4.0 years ago

AnonymousEngineer • 0

Hi everyone,

I am working with RNAseq reads from a plant/pathogen culture. The plants are infected with a very small inoculum of pathogen. The number of inoculum delivered successfully to each plant is different, and each plant has different cell mass. At different time points post infection RNA is extracted and mapped to the sequences of both organisms.

At all time points the number of mapped RNA fragments of the pathogen was very very low, compared to that of the plant. Prior to inoculation, the pathogen is dormant and the number of reads of most genes is zero in the early time points. As the plant infection progresses over time, the number of pathogen cells within the plants increases and the number of reads start to increase. Here is the image of the mapped/assigned fragments of both organisms. You can see in the table below in the image, the number of fragments (and reads) mapped to the pathogen are very low relative to the plant, and especially so in the early time points. Additionally, the number of assigned fragments of the pathogen varies between replicates.

Here, I am trying to (1) minimize the dispersion of each gene's reads within replicates of each specific time point; and (2) capture the temporal nature of the progress of infection by the pathogen. I am doing a custom TPM-like minimization, in which instead of normalizing with 10^6, I am normalizing with the maximum assigned fragments of 3 replicates for each time point. TPM (and FPKM) inflates the nonzero reads in the early time points as most genes have zero reads. The dispersion indeed decreases between reads of replicates.

RPK= (raw counts/gene length in KB)

TPM = RPK*10^6/(sum of all RPK values)

custom = RPK*(maximum assigned fragments of 3 replicates for each time point)/(sum of all RPK values)

This Read Counts table show the raw/TPM/custom counts of the genes.

Am I doing this right? Is there a better way of doing it?

Thank you in advance!

Anby

RNA-Seq TPM normalization time-series • 1.1k views

ADD COMMENT • link 4.0 years ago by AnonymousEngineer • 0

1

Entering edit mode

Why do you want to minimize the dispersion? Shouldn't you be using that dispersion to differentiate between the biological variability and effects of pathogen. Use your counts in DESeq2 and it will use those dispersion estimates.

ADD REPLY • link 4.0 years ago by ashish ▴ 680

0

Entering edit mode

I am not sure how DESeq2 fits in. The reads shown here are of the pathogen genes, and I cannot do a differential expression analysis of the pathogen here as reads of pathogen in uninfected control plants are all zeros.

Once the pathogen starts to replicate, the reads of the pathogen start to increase. However, the extent of pathogen infection of each plant is dependent upon the initial inoculum that gets delivered, and we have no control over it. The different biological replicates are expected to show similar trends of gene expression as they are in the same stage of life cycle. Hence I want to normalize the reads so that the dispersion of a gene's reads within 3 replicates is minimized.

Anby

ADD REPLY • link 4.0 years ago by AnonymousEngineer • 0

0

Entering edit mode

deseq2 and edgeR control for the fact that changes in biomass occur. If it was me, I would use edgeR or deseq2 individually with the two different organisms to determine normalized counts (taking into account changes in total library size and normalization factors) and which genes were changing in expression. This should account for the fact that total number of orgs increased. However, both have an assumption that most genes are NOT differentially expressed in your organisms. If you think that is invalidated then perhaps this manual solution is your best option.

ADD REPLY • link 4.0 years ago by N15 ▴ 160