Find most upregulated genes in one library?
Entering edit mode
4.8 years ago

Good Evening,

I have a very trivial question for you all. I am running a bioinformatics pipeline (with Rsubread and EdgeR) on some RNA-seq data, and am trying to find the most upregulated genes In one library. To be more specific; I have sequencing data from three cell types: Cardiomyocyte, Endothelial, and Fibroblasts and I am well on my way into finding genes that are differentially expressed across the three groups, but how do I find the most highly expressed genes in ONLY the cardiomyocyte group? Here is the sample code that I've used so far to do the differential expression analysis across multiple groups:

featureCounts_8 <- featureCounts(files= all_bam_8, annot.inbuilt = "hg38", allowMultiOverlap = TRUE, nthreads = 16)

grouping <- factor(c("A","A","A","A","B","B","B","B","B","C","C","C","C","C","C","D","D","D","D","E","E","E","F","F","F","F","G","G","G","G")) // GROUP A are the cardiomyocte libraries, group D are the endothelial cell libraries, and group F are the fibroblast samples. This line of code is because I have 30 libraries, and each correspond to a different group.

DGEList_8 <- DGEList(counts=featureCounts_8$counts, genes=featureCounts_8$annotation, group = grouping)
DGEList_8 <- calcNormFactors(DGEList_8)
design <- model.matrix(~DGEList_8$samples$group) // THERE ARE 7 TOTAL GROUPS IN THIS DGELIST. Remember that the only ones I'm interested in are A, D, and F (aka groups 0, 3, and 5) The rest of the groups are control groups
DGEList_8 <- estimateDisp(DGEList_8, design)
fit <- glmQLFit(DGEList_8, design) 
// BELOW IS WHERE I RUN THE QLFTESTS for each comparison, assuming that the CARDIOMYOCYTE group is my intercept
qlf <- glmQLFTest(fit, coef = 3) //to compare group A (group 1: cardiomyocyte) vs group D (group 3: endothelial)
qlf <- glmQLFTest(fit, coef = 6) //to compare group A (group 1: cardiomyocyte) vs group F (group 6: Fibroblast)

I posted my code so that hopefully someone may point out the step in which I should do my intra-library expression test. I will re-iterate the question: how do I find the highest up-regulated genes in ONLY group A? Meaning, how do I find the genes with the highest read counts relative to all of the other genes found in library A?

Thanks in advance for your help,

EdgeR RNA-Seq Rsubread • 1.2k views
Entering edit mode

I'm not clear on your meaning? To me 'up-regulated' suggests a change from one state to another. Do you mean the genes that are higher in groupA compared to the control groups, or do you really mean " genes with the highest read counts relative to all of the other genes found in library A?"

for the former, I recommend putting all your control libraries together so that your group factor has the levels "Cardiomyocyte", "Control", "endothelial", "Fibroblast". You could then do the gplmQLFTest with the setting contrast=c(1,-1,0,0)

If the answer is the former, and you really meant what you said above then....

The trivial answer is:

groupA_cpm <- cpm(DEList_8)[,1:]
groupA_means <- rowMeans(groupA_cpm)
groupA_sorted <- sort(groupA_means, descreasing=TRUE)
top_genes <- head(groupA_sorted)

But I'm pretty sure this isn't what you want, nor is will it mean what you think it means if it is what you want.

1) The mostly highly expressed genes will be the same in all cell types. I guess they'll be the genes of basic metabolism, the basic gene expression machinery and the mitochondrial genes. In all cell types. In general very highly expressed genes are not very interesting.

2) The counts of reads in a gene is not a good way of estimating its expression relative to other genes in the same sample, only of estimating its expression relative to the same gene in other samples. Consider two genes, A and B. A is 1000nt long and B is 200nt long. Lets say a library has one copy of A and 5 of B. When A is broken up into 200nt fragments as part of library prep, there will be 5 reads from A. Because B is already 200nt it won't be broken up, but there are 5 copies of it, so there will also be 5 reads from B. So you'll get the same number of reads from A as from B, even though the cell has 5 times as many copies of B. This is only one bias, there are others, for example GC content. I you want to compare expression of 2 genes within one sample, you need to use a method that produces TPM (transcripts per million) values, like Salmon, Kallisto or RSEM.

Entering edit mode

Thank you for the reply, you answered my question. I will generate a TPM chart and see if i find anything interesting.

Also, that thing that you thought wasn't what i wanted, and also thought it wouldn't mean what i thought it would mean if it was what i really wanted, is actually something i didn't know i wanted, but now do want :P


Login before adding your answer.

Traffic: 2306 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6