Hello all, I am currently working on RNA seq analysis of TCGA and SRA cancer data.I have processed the SRA data similar to TCGA by following their pipeline.The problem appears in differential expression analysis.I have used edgeR, voom and DESeq to carry out the DEG analysis using both expected and normalized count.I am not confident about the results since the pooled normal samples that come from both SRA and TCGA cluster differently which leads me to believe that there might be a bias.I have 2 questions to ask regarding this: 1.Is the broad firehose data of TCGA genes.normalized.results samples already log 2 normalized after upper quartile normalization or should we normalize them before finding the differentially expressed genes? 2.Is there any other method to identify the DEGs which takes into account the database differences?
Regards, Swati krishna
Did you try doing batch normalization (using Combat)? TCGA and SRA might be sequenced using different library kits (poly A vs rib zero or vice versa )which also have to be corrected.
Hello Ron, Sorry for the late reply.yes, i tried Combat,but its giving negative values in the result matrix.How can i give it as an input to edgeR?
Check out these links: edgeR: Correct pipeline for DE analysis with multiple conditions and batches
http://www.bioconductor.org/help/workflows/rnaseqGene/#batch
Remove Batch Effect From RNAseq with SVAseq and Combat
I used the below code to use Combat to remove batch effects from normal and cancer expected counts seperately. I replaced the negative values with 0 and continued with edgeR. Is this a right way to proceed?
sample=c("N1","N2","N3","N4","N5","N6", "N7","N8","N9","N10", "N11","N12","N13","N14","N15","N16", "N17","N18","N19","N20", "N21","N22","N23","N24","N25","N26", "N27","N28","N29","N30")
db=c(rep("TCGA",12),rep("SRA",18))
type=c(4,1,4,1,5,4,4,1,5,1,1,4,2,3,3,3,2,3,3,2,2,2,2,2,2,3,2,2,3,3)
batch=data.frame(sample,db,type)
my_DGEList <- DGEList(counts=normal_ec)
normal_norm <- calcNormFactors(my_DGEList, method="upperquartile")
my_data = cpm(normal_norm, log=TRUE, prior.count=2)
combat <- ComBat(dat=my_data, batch=my_batch,par.prior = FALSE, prior.plots = TRUE) # batch file included 5 batches categorized according to the cluster dendrogram plotted for the normal data.