TCGA provides sample preparation and sequencing information in MAGE TAB format along with the primary data (see here). Does the ICGC provide something similar? I can't seem to find anything like it on the website nor through Google.
In particular, I wonder:
- Have variants been (re)-called from raw BAM files for the ICGC simple somatic mutation (SSM) files or were they simply lifted over from the VCF files of the primary projects (like TCGA) they were taken from?
- What normalization strategy was used for the RNASeq data? A link is provided to the 2010 RSEM paper in the RNASeq file I’m now looking at (from the project SKCM-US), can I therefore assume the normalized values to be in TPM/10^6 units?
- Why the amount of features per donor varies in the RNASeq data? Some have on the order of 20k, others exactly twice as much. If some samples were PE sequenced, is there some way to confirm this? I noticed that the correlation between the two feature expression levels for these donors is not very high.
The ICGC data appeals to us as variant allelic fraction is included in the SSM files, something we would like to use in our analyses, and because of its intuitive interface.
Thanks for your input!