Question

Limma - normalizing gene expression data from multiple SRA experiments

0

Entering edit mode

8 months ago

peter.polacek • 0

I have selected a number of SRA RNA-seq runs, each being, let's say, the root, leaf or flower tissue of a certain species. These come from different studies, conducted using different platforms, library sources, targeting different cultivars under different experimental conditions. I would like to determine the gene expression of all genes by run and normalize, so that I can look at the overall expression of a certain gene in the root, in the leaf etc.

I determined Limma would be the best option, with controlling for batch effects. Is this a reasonable approach at all or are there too many variables to ever be usable? If it is possible, are there any considerations to specifying this design in the model.matrix? Thanks for all advice.

edgeR Limma RNAseq SRA • 569 views

ADD COMMENT • link 8 months ago by peter.polacek • 0

0

Entering edit mode

Is this a reasonable approach at all or are there too many variables to ever be usable?

From what you have described it sounds like there is considerable confounding of potential batch effect sources with the different sample, but it's difficult to answer your questions without more details. Can you provide a sample metadata table with all this information?

ADD REPLY • link 8 months ago by jv ★ 1.8k

0

Entering edit mode

Hi jv, I can't provide the entire metadata table, because the idea is to use the entire collection of RNA-seq experiments for a given organism (Capsicum annuum - pepper). Now I'm working with a set of around 1,800 individual runs. The set will be used to construct a gene co-expression network, but I would like to also summarize into a per-tissue and per-developmental stage gene expression. The combinations of sample variables are really a lot.

I am thinking of reducing the number of tissues included, let's say to five, and perhaps 10 developmental stages (four for vegetative tissues and six for fruit tissues). For the cultivar information, I would prefer to disregard it, as almost every experiment uses a different cultivar. Is this reasonable?

For the platform variable, I have data on the method (Illumina, 454, promethION etc.) but also specific version (Illumina NovaSeq 6000, Illumina HiSeq 4000, etc.). I would prefer to include this information in the processing, since this can cause confounding. Would it be a reasonable approach to simplify it to the method, and disregard the instrument version? This would leave some 4 or 5 different categories. I would also include library selection information, of some 7 categories.

Alternatively, the simplest approach would be to calculate within-experiment differential expression between different tissues and developmental stages, and work with that kind of data.

ADD REPLY • link 8 months ago by peter.polacek • 0

0

Entering edit mode

What you are proposing is not trivial. There are numerous source of batch effects when trying to combine such disparate sources of data, one good example would be TCGA data. I recommend searching for "batch effect TCGA" on pubmed to get a sense of what others have observed and tried. One method I've looked into but was not able to apply in my own work was from the team that developed the RUV method: Removing unwanted variation from large-scale RNA sequencing data with PRPS.

ADD REPLY • link 8 months ago by jv ★ 1.8k

0

Entering edit mode

Thanks for the links, that was useful to read. I won't be attempting this then, it's not worth it for my purposes. Instead, I'm thinking of considering each sample separately, and looking at the percentile of gene expression for a given gene. I have seen studies where they set thresholds by percentile, let's say, genes in the 80th percentile of highest expressed genes are considered high-expression, genes below the 40th percentile are considered low-expression. Based on this, I am thinking of calculating percentiles for the genes in each sample, and then looking at the average percentile for a gene in different groups (leaf tissue, fruit tissue etc.). I have tested this on several genes I'd expect to be expressed differently and I'm seeing exactly the results I expected.

ADD REPLY • link 8 months ago by peter.polacek • 0