Question

deseq2 normalization for Gtex datasets

1

Entering edit mode

20 months ago

rheab1230 ▴ 140

Hello, I have gene count and sample attributes file from gtex v8 portal. I want to first extract brain tissue gene count data only by using the samID coming from sample attributes.txt file. Once, I have gene count coming from brain cortex tissue, I want to normallize it using deseq2 package. This is what I have done:

GTEx_Analysis_v8_Annotations_SampleAttributesDS <- read.delim("GTEx_Analysis_v8_Annotations_SampleAttributesDS.txt")
sample_attributes <- select(GTEx_Analysis_v8_Annotations_SampleAttributesDS,SAMPID,SMTS,SMTSD,SMAFRZE)

sample_attributes_braindata <- sample_attributes %>% filter(sample_attributes$SMTS == "Brain" & sample_attributes$SMAFRZE == "RNASEQ")
sample_attributes_braindata_cortex = sample_attributes_braindata[sample_attributes_braindata$SMTSD == "Brain - Cortex",]
dim(sample_attributes_braindata_cortex)
library(tidyverse)


GTEx_Analysis_gene_reads <- read_table2("GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_reads.gct")
GTEx_Analysis_gene_reads <- data.frame(GTEx_Analysis_gene_reads)
(GTEx_Analysis_gene_reads)[1:5,1:5]
dim(GTEx_Analysis_gene_reads)

gtex_gene_name = GTEx_Analysis_gene_reads$Name
gtex_gene_description = GTEx_Analysis_gene_reads$Description

sample_attributes_braindata_cortex$SAMPID = gsub(pattern = "-", replacement = ".", x = sample_attributes_braindata_cortex$SAMPID, fixed = TRUE)

index= colnames(GTEx_Analysis_gene_reads) %in% sample_attributes_braindata_cortex$SAMPID
sum(index)
GTEx_Analysis_gene_reads_cortex = GTEx_Analysis_gene_reads[,index]
rownames(GTEx_Analysis_gene_reads_cortex) = gtex_gene_name

#write.table(GTEx_Analysis_gene_reads_cortex, file = "GTEx_Analysis_gene_reads_cortex_unnorm.txt", sep="\t", quote=FALSE, col.names=NA)

#GTEx_Analysis_gene_reads_cortex = read.delim("GTEx_Analysis_gene_reads_cortex_unnorm.txt",row.names = 1)

sample_info<-data.frame(condition=colnames(GTEx_Analysis_gene_reads_cortex), row.names=colnames(GTEx_Analysis_gene_reads_cortex))
library(DESeq2)
ds<-DESeqDataSetFromMatrix(countData=GTEx_Analysis_gene_reads_cortex, colData=sample_info, design= ~condition)
keep_genes<-rowSums(counts(ds))>0
ds<-ds[keep_genes,]
ds<-estimateSizeFactors(ds)
normalized_counts<-counts(ds, normalized=TRUE)


write.table(normalized_counts,file="GTEx_Analysis_gene_reads_cortex_normalized.txt", sep= "\t", quote = F, row.names = F)

But the normalized gene count that I get from this is not normalized. There is some error in this. I don't have any condition in this case. I have used condtion here as colnames which represent the sampleID or GTEX-1117F. Should I use condition as gene names to normalize it? I am not familiar with GTEx data that much and need some guidance. Thank you.

gene GTEX expression deseq2 tissue normalization brain • 941 views

ADD COMMENT • link updated 20 months ago by LChart 3.9k • written 20 months ago by rheab1230 ▴ 140

score 1 · Answer 1 · 2022-08-04

1

Entering edit mode

20 months ago

LChart 3.9k

I have used condtion here as colnames which represent the sampleID or GTEX-1117F.

This is probably the core problem problem; the condition should have fewer levels than the number of samples. In a pinch you can use ~ 1 as your design; but for GTEx I would strongly recommend including all of the covariates that are potential confounders (age, sex, ischemic time, RIN; see https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1323-z ; figure 5)

ADD COMMENT • link 20 months ago by LChart 3.9k

0

Entering edit mode

Hello, I do include covariates later. At first I just want to normalize the data and then I generate covariates file using peer tool and then regress it out from normalized gene count to generate residual expression

ADD REPLY • link 20 months ago by rheab1230 ▴ 140

0

Entering edit mode

So i should remove sample_info line and then just use this?

ds<-DESeqDataSetFromMatrix(countData=GTEx_Analysis_gene_reads_cortex, colData=sample_info, design= ~)