How the level 3 TCGA raw counts are calculated for each sample?
2
4
Entering edit mode
7.1 years ago
pbio ▴ 150

I couldn't find a proper documentation of the softwares used for generating the read counts of the TCGA level 3 data.

I have done 21 normal sample Vs 21 tumor samples analysis using TCGA RNASeq level3 data to find deferentially expressed genes using DESeq.

And further I have taken illumina body map SRA file, processed using TOPHAT and generated counts using HTSeq. The HTSeq read counts generated using TOPHAT bam files were compared with  21 tumor sample from TCGA level 3 data.

So, now as expected the differential expressed genes using DESeq between "illumina body map comparison with 21 tumor samples" and  "21 normal sample Vs 21 tumor samples from TCGA" should have good overlap of deferentially expressed genes. But the overlapping genes are very less.

Does this means there is something wrong with the processing of illumina body map file and or due to the variation in protocol followed for TCGA data?

Could anyone tell me how the read counts in the TCGA level 3 data is generated? using which program?

RNA-Seq TCGA illumina body map • 7.4k views
2
Entering edit mode

I'm 99% certain that they use RSEM (after mapsplice, I think), though I don't know the version numbers or any options that they specify. I imagine that this could give rather divergent results from tophat2 -> htseq-count...there'd at least be a batch effect.

0
Entering edit mode

I have download TCGA RNASeqV1 which uses RPKM instead of RSEM, I think RNASeqV2 uses RSEM? And if it is batch effect, what can be done to get rid of batch effect?

And as far as my knowledge TCGA RNASeqV1 (TOPHAT2+cuffdiff+cufflink)  uses Tuxedo pipeline to do the analysis, But if this is the case how do they generate raw count?

0
Entering edit mode

Ah, yeah, V1 data is different and I don't know off-hand how that was made. If they did use cufflinks then it's unlikely that they used raw counts at any point (though you can use the merge GTF file with htseq-count to get them).

2
Entering edit mode

Comparison across data sets generated by different groups is not something that you should expect to work well. To make matters worse, the data processing for the different sets appears to be quite different.  A lot of folks seem to make the assumption that since "it is all RNA-seq", it should be possible to make comparisons between any two datasets.  Unfortunately, that is generally not true.  The same problems exist as for microarrays.  Batch effect is something that can be minimized, but not ignored.

0
Entering edit mode

If I could up-vote this more than once I would. It's amazing how many people try to ignore the simple truth that, "Batch effect is something that can be minimized, but not ignored."

0
Entering edit mode

I totally agree.  As I mentioned below, we re-processed all of the RNA-Seq data from multiple datasets as part of our pipeline, in order to minimize the batch effect.  You still have differences but at least it can be minimized. TCGA was especially messy due to different aligners used, different genomes used, etc.

0
Entering edit mode

How do you solve your problem at end?I just want to combine TCGA level 3 data with my RNAseq htseq-count to get differential expression gene.But I'm not sure what parameters to use to rerun my RNASeq raw data with MapSplice &RSEM in order to be in accordance with TCGA level 3 data.

7
Entering edit mode
7.1 years ago
GenoMax 119k

TCGA V1 analysis (old) used BWA and the V2 analysis (new) which uses MapSplice. All V1 data was reprocessed as V2.

This file has details about the V1 and V2 pipelines

Note: Since TCGA data has now moved to GDC, the description file has been updated to point to the new resource.

From the description file

V1_BWAtoTranscriptome, V1_RNASeqQuantification: UNC V1 RNA-Seq Workflow - BWA Alignment to Transcriptome
Date: 20101108

And

V2_MapSpliceRSEM: UNC V2 RNA-Seq Workflow - MapSplice genome alignment and RSEM estimation of GAF 2.1
Date: 05-10-2012

0
Entering edit mode

Thanks genomax2 for you answer but they haven't mentioned the software used at the expression quantification level and I also have the bam files for all tissues from illumina body map. So, could you please tell the pipeline which is followed for generating the illumina body map bam files?

0
Entering edit mode
7.1 years ago
matt.newman ▴ 170

This website has information on the pipeline: https://wiki.nci.nih.gov/display/TCGA/RNASeq+Version+2.

With our OncoLand data service (http://www.omicsoft.com/oncoland-service), we've processed nearly 20,000 RNA-Seq samples.  We found that we had to use the same pipeline (from aligner to counting algorithm), in order to get good results for downstream analysis.