Question

Comparison of TCGA and SRA data

1

Entering edit mode

7.5 years ago

swatikrishna113 ▴ 10

Hello all, I am currently working on RNA seq analysis of TCGA and SRA cancer data.I have processed the SRA data similar to TCGA by following their pipeline.The problem appears in differential expression analysis.I have used edgeR, voom and DESeq to carry out the DEG analysis using both expected and normalized count.I am not confident about the results since the pooled normal samples that come from both SRA and TCGA cluster differently which leads me to believe that there might be a bias.I have 2 questions to ask regarding this: 1.Is the broad firehose data of TCGA genes.normalized.results samples already log 2 normalized after upper quartile normalization or should we normalize them before finding the differentially expressed genes? 2.Is there any other method to identify the DEGs which takes into account the database differences?

Regards, Swati krishna

RNA-Seq TCGA SRA Diffrential Expression • 2.5k views

ADD COMMENT • link updated 7.3 years ago by Biostar 20 • written 7.5 years ago by swatikrishna113 ▴ 10

0

Entering edit mode

Did you try doing batch normalization (using Combat)? TCGA and SRA might be sequenced using different library kits (poly A vs rib zero or vice versa )which also have to be corrected.

ADD REPLY • link 7.5 years ago by Ron ★ 1.2k

0

Entering edit mode

Hello Ron, Sorry for the late reply.yes, i tried Combat,but its giving negative values in the result matrix.How can i give it as an input to edgeR?

ADD REPLY • link 7.4 years ago by swatikrishna113 ▴ 10

0

Entering edit mode

Check out these links: edgeR: Correct pipeline for DE analysis with multiple conditions and batches

http://www.bioconductor.org/help/workflows/rnaseqGene/#batch

Remove Batch Effect From RNAseq with SVAseq and Combat

ADD REPLY • link 7.4 years ago by Ron ★ 1.2k

0

Entering edit mode

I used the below code to use Combat to remove batch effects from normal and cancer expected counts seperately. I replaced the negative values with 0 and continued with edgeR. Is this a right way to proceed?

sample=c("N1","N2","N3","N4","N5","N6", "N7","N8","N9","N10", "N11","N12","N13","N14","N15","N16", "N17","N18","N19","N20", "N21","N22","N23","N24","N25","N26", "N27","N28","N29","N30")

db=c(rep("TCGA",12),rep("SRA",18))

type=c(4,1,4,1,5,4,4,1,5,1,1,4,2,3,3,3,2,3,3,2,2,2,2,2,2,3,2,2,3,3)

batch=data.frame(sample,db,type)

my_DGEList <- DGEList(counts=normal_ec)

normal_norm <- calcNormFactors(my_DGEList, method="upperquartile")

my_data = cpm(normal_norm, log=TRUE, prior.count=2)

combat <- ComBat(dat=my_data, batch=my_batch,par.prior = FALSE, prior.plots = TRUE) # batch file included 5 batches categorized according to the cluster dendrogram plotted for the normal data.

ADD REPLY • link 7.4 years ago by swatikrishna113 ▴ 10