deseq2 normalization for Gtex datasets
1
0
Entering edit mode
6 days ago
rheab1230 ▴ 60

Hello, I have gene count and sample attributes file from gtex v8 portal. I want to first extract brain tissue gene count data only by using the samID coming from sample attributes.txt file. Once, I have gene count coming from brain cortex tissue, I want to normallize it using deseq2 package. This is what I have done:

GTEx_Analysis_v8_Annotations_SampleAttributesDS <- read.delim("GTEx_Analysis_v8_Annotations_SampleAttributesDS.txt")
sample_attributes <- select(GTEx_Analysis_v8_Annotations_SampleAttributesDS,SAMPID,SMTS,SMTSD,SMAFRZE)

sample_attributes_braindata <- sample_attributes %>% filter(sample_attributes$SMTS == "Brain" & sample_attributes$SMAFRZE == "RNASEQ")
sample_attributes_braindata_cortex = sample_attributes_braindata[sample_attributes_braindata$SMTSD == "Brain - Cortex",] dim(sample_attributes_braindata_cortex) library(tidyverse) GTEx_Analysis_gene_reads <- read_table2("GTEx_Analysis_2017-06-05_v8_RNASeQCv1.1.9_gene_reads.gct") GTEx_Analysis_gene_reads <- data.frame(GTEx_Analysis_gene_reads) (GTEx_Analysis_gene_reads)[1:5,1:5] dim(GTEx_Analysis_gene_reads) gtex_gene_name = GTEx_Analysis_gene_reads$Name
gtex_gene_description = GTEx_Analysis_gene_reads$Description sample_attributes_braindata_cortex$SAMPID = gsub(pattern = "-", replacement = ".", x = sample_attributes_braindata_cortex$SAMPID, fixed = TRUE) index= colnames(GTEx_Analysis_gene_reads) %in% sample_attributes_braindata_cortex$SAMPID
sum(index)

library(DESeq2)
keep_genes<-rowSums(counts(ds))>0
ds<-ds[keep_genes,]
ds<-estimateSizeFactors(ds)
normalized_counts<-counts(ds, normalized=TRUE)

write.table(normalized_counts,file="GTEx_Analysis_gene_reads_cortex_normalized.txt", sep= "\t", quote = F, row.names = F)


But the normalized gene count that I get from this is not normalized. There is some error in this. I don't have any condition in this case. I have used condtion here as colnames which represent the sampleID or GTEX-1117F. Should I use condition as gene names to normalize it? I am not familiar with GTEx data that much and need some guidance. Thank you.

1
Entering edit mode
6 days ago
LChart ▴ 420

I have used condtion here as colnames which represent the sampleID or GTEX-1117F.

This is probably the core problem problem; the condition should have fewer levels than the number of samples. In a pinch you can use ~ 1 as your design; but for GTEx I would strongly recommend including all of the covariates that are potential confounders (age, sex, ischemic time, RIN; see https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1323-z ; figure 5)

0
Entering edit mode

Hello, I do include covariates later. At first I just want to normalize the data and then I generate covariates file using peer tool and then regress it out from normalized gene count to generate residual expression

0
Entering edit mode

So i should remove sample_info line and then just use this?

ds<-DESeqDataSetFromMatrix(countData=GTEx_Analysis_gene_reads_cortex, colData=sample_info, design= ~)

0
Entering edit mode

That would give you library-normalized counts; yes.