8 months ago by
University College London Cancer Institute
I think that you should be prepared to receive a deluge of information. The TCGA is a very 'rich' resource with literally petabytes of archived data. Even if your proposed dataset is small, there will be a wealth of clinical information to sift through, and you will need to carefully link this clinical information to your germline and mutation data via, preferably, a UUID (universally unique ID). There are many types of IDs, though, and you need to be sure of the exact one you're looking at.
Great care, can't stress this enough, has to be taken because what my look like a unique identifier for a sample may not be unique and may refer to, for example, a paired tumour and normal sample from the same patient. You wouldn't want to mix those up! Please read more here:
If you end up with an ID and don't know to what it refers, you can look it up at the GDC Data Portal: https://portal.gdc.cancer.gov/exploration
On the other points, it's difficult for anyone to give specific advice for the following reasons:
- We don't know if you're going to receive the open access (already processed) or restricted (raw) data. You're partly interested in mutation data. The processed version of this data direct from the TCGA is in MAF (Mutation Annotation Format - see more here: https://wiki.nci.nih.gov/display/TCGA/Mutation+Annotation+Format+(MAF)+Specification) format. It gets more complicated by the fact that, sometimes, the same sample is processed at 2 or more different centers and you'll therefore find it duplicated. Also, different centers used different somatic variant callers, and some believe that we should not therefore analyse these different datasets together. If you're getting the restricted data, you may have access to the more traditional VCF files, and possibly even the aligned BAM files prior to calling variants, which would make your work a lot easier from my perspective.
- We don't know what your hypotheses are - the type of statistical test(s) will depend on this
That's all that I can think of right now, but I think that I've emphasised what are some of the critical issues of just managing and understanding the TCGA data when you get it.