Question: How the level 3 TCGA raw counts are calculated for each sample?
3
gravatar for pbio
4.0 years ago by
pbio120
United States
pbio120 wrote:

I couldn't find a proper documentation of the softwares used for generating the read counts of the TCGA level 3 data.

I have done 21 normal sample Vs 21 tumor samples analysis using TCGA RNASeq level3 data to find deferentially expressed genes using DESeq.

And further I have taken illumina body map SRA file, processed using TOPHAT and generated counts using HTSeq. The HTSeq read counts generated using TOPHAT bam files were compared with  21 tumor sample from TCGA level 3 data.

So, now as expected the differential expressed genes using DESeq between "illumina body map comparison with 21 tumor samples" and  "21 normal sample Vs 21 tumor samples from TCGA" should have good overlap of deferentially expressed genes. But the overlapping genes are very less.

Does this means there is something wrong with the processing of illumina body map file and or due to the variation in protocol followed for TCGA data?

Could anyone tell me how the read counts in the TCGA level 3 data is generated? using which program? 

rna-seq tcga illumina body map • 5.6k views
ADD COMMENTlink modified 4.0 years ago by genomax69k • written 4.0 years ago by pbio120
2

I'm 99% certain that they use RSEM (after mapsplice, I think), though I don't know the version numbers or any options that they specify. I imagine that this could give rather divergent results from tophat2 -> htseq-count...there'd at least be a batch effect.

ADD REPLYlink modified 4.0 years ago • written 4.0 years ago by Devon Ryan91k

I have download TCGA RNASeqV1 which uses RPKM instead of RSEM, I think RNASeqV2 uses RSEM? And if it is batch effect, what can be done to get rid of batch effect?

And as far as my knowledge TCGA RNASeqV1 (TOPHAT2+cuffdiff+cufflink)  uses Tuxedo pipeline to do the analysis, But if this is the case how do they generate raw count?

ADD REPLYlink modified 4.0 years ago • written 4.0 years ago by pbio120

Ah, yeah, V1 data is different and I don't know off-hand how that was made. If they did use cufflinks then it's unlikely that they used raw counts at any point (though you can use the merge GTF file with htseq-count to get them).

ADD REPLYlink written 4.0 years ago by Devon Ryan91k
2

Comparison across data sets generated by different groups is not something that you should expect to work well. To make matters worse, the data processing for the different sets appears to be quite different.  A lot of folks seem to make the assumption that since "it is all RNA-seq", it should be possible to make comparisons between any two datasets.  Unfortunately, that is generally not true.  The same problems exist as for microarrays.  Batch effect is something that can be minimized, but not ignored. 

ADD REPLYlink written 4.0 years ago by Sean Davis25k

If I could up-vote this more than once I would. It's amazing how many people try to ignore the simple truth that, "Batch effect is something that can be minimized, but not ignored."

ADD REPLYlink written 4.0 years ago by Devon Ryan91k

I totally agree.  As I mentioned below, we re-processed all of the RNA-Seq data from multiple datasets as part of our pipeline, in order to minimize the batch effect.  You still have differences but at least it can be minimized. TCGA was especially messy due to different aligners used, different genomes used, etc.

ADD REPLYlink written 4.0 years ago by matt.newman130

How do you solve your problem at end?I just want to combine TCGA level 3 data with my RNAseq htseq-count to get differential expression gene.But I'm not sure what parameters to use to rerun my RNASeq raw data with MapSplice &RSEM in order to be in accordance with TCGA level 3 data.

ADD REPLYlink written 2.1 years ago by keryruo10
7
gravatar for genomax
4.0 years ago by
genomax69k
United States
genomax69k wrote:

TCGA V1 analysis (old) used BWA and the V2 analysis (new) which uses MapSplice. All V1 data was reprocessed as V2.

This file has details about the V1 and V2 pipelines

Note: Since TCGA data has now moved to GDC, the description file has been updated to point to the new resource.

From the description file

V1_BWAtoTranscriptome, V1_RNASeqQuantification: UNC V1 RNA-Seq Workflow - BWA Alignment to Transcriptome
Date: 20101108

And

V2_MapSpliceRSEM: UNC V2 RNA-Seq Workflow - MapSplice genome alignment and RSEM estimation of GAF 2.1
Date: 05-10-2012

 

ADD COMMENTlink modified 3.0 years ago • written 4.0 years ago by genomax69k

Thanks genomax2 for you answer but they haven't mentioned the software used at the expression quantification level and I also have the bam files for all tissues from illumina body map. So, could you please tell the pipeline which is followed for generating the illumina body map bam files? 

ADD REPLYlink modified 4.0 years ago • written 4.0 years ago by pbio120
0
gravatar for matt.newman
4.0 years ago by
matt.newman130
United States
matt.newman130 wrote:

This website has information on the pipeline: https://wiki.nci.nih.gov/display/TCGA/RNASeq+Version+2.

With our OncoLand data service (http://www.omicsoft.com/oncoland-service), we've processed nearly 20,000 RNA-Seq samples.  We found that we had to use the same pipeline (from aligner to counting algorithm), in order to get good results for downstream analysis.

ADD COMMENTlink written 4.0 years ago by matt.newman130
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1477 users visited in the last hour