Question

Association study using germline SNPs, somatic mutations in tumors and clinical data from TCGA

2

Entering edit mode

6.6 years ago

rGun ▴ 10

Hi all, I'm interested in doing an association study using germline SNPs of specific pathways, tumor mutations and clinical data from a patient population in TCGA. I'm currently in the process of getting access to the datasets. If anyone has experience with this type of a study, any advice on how I could go about doing this is greatly appreciated.

Thanks in advance!

SNP TCGA Tumor mutations Clinical data • 2.0k views

ADD COMMENT • link 6.6 years ago by rGun ▴ 10

0

Entering edit mode

Thanks a lot, Kevin. We have already done a study where we looked at association between SNPs/certain haplotypes with increased risk for BrCa. We wanted to see whether our findings will hold in another similar dataset. Since we are in the process of getting access to restricted TCGA data, I thought of analyzing the tumor mutations to see any association with the SNPs/haplotypes of interest.

ADD REPLY • link 6.6 years ago by rGun ▴ 10

1

Entering edit mode

Hey, I have done a lot of research myself on breast cancer, including from the TCGA. If you are expected to get access to the restricted data, and assuming that you have some bioinformatics expertise, then I think that it would be useful to re-analyse from the BAM file stage (to produce variant listings as VCFs) in order to ensure that you're calling germline and somatic variants in the same way across all samples. Like I mentioned, the publicly-available TCGA mutation data is MAF-formatted (Mutation Annotation Format), and different centers called somatic variants in different ways.

If you have the capacity to do the above at your institute/dept. (including personnel, compute power, etc), then a very interesting analysis would be to convert the VCF data into PLINK format and to conduct your association analysis there. I recently posted a tutorial about how one can convert VCF to PLINK: Produce PCA for 1000 Genomes Phase III in VCF format

When you get the metadata for the breast cancer samples, you could additionally format it as a phenotype file for PLINK and do all sorts of cool analyses, such as adjusting for different traits and BrCa sub-types.

There are undoubtedly many types of analyses that one could do. I've just outlined the one that I think matches best what you're aiming to do. One other that may be of interest is to conduct a lasso regression of all mutation data to find the best predictors of an end-point of interest. This has been done and published in the past by the Caldas group at Cambridge, I believe, but they may not have looked at all end-points.

ADD REPLY • link 6.6 years ago by Kevin Blighe 87k

0

Entering edit mode

Thanks for the information! I really appreciate it. I have some experience in analyzing datasets but this will be my first venture into TCGA data. However, the clinical correlations with a SNPs will be done by our collaborator. I'd really appreciate if you could direct me to any similar study that you or anyone else has done to get a better idea.

ADD REPLY • link 6.6 years ago by rGun ▴ 10

score 1 · Answer 1 · 2017-09-13

I think that you should be prepared to receive a deluge of information. The TCGA is a very 'rich' resource with literally petabytes of archived data. Even if your proposed dataset is small, there will be a wealth of clinical information to sift through, and you will need to carefully link this clinical information to your germline and mutation data via, preferably, a UUID (universally unique ID). There are many types of IDs, though, and you need to be sure of the exact one you're looking at.

Great care has to be taken because what my look like a unique identifier for a sample may not be unique and may refer to, for example, a paired tumour and normal sample from the same patient. You wouldn't want to mix those up! Please read more here:

TCGA barcodes: https://wiki.nci.nih.gov/display/TCGA/TCGA+barcode (pay close attention to the exact format of the barcode and how it allows us to distinguish a normal from tumour sample)
UUIDs: https://wiki.nci.nih.gov/display/TCGA/Universally+Unique+Identifier

If you end up with an ID and don't know to what it refers, you can look it up at the GDC Data Portal: https://portal.gdc.cancer.gov/exploration

On the other points, it's difficult for anyone to give specific advice for the following reasons:

We don't know if you're going to receive the open access (already processed) or restricted (raw) data. You're partly interested in mutation data. The processed version of this data direct from the TCGA is in MAF (Mutation Annotation Format - see more here: https://wiki.nci.nih.gov/display/TCGA/Mutation+Annotation+Format+(MAF)+Specification ) format. It gets more complicated by the fact that, sometimes, the same sample is processed at 2 or more different centers and you'll therefore find it duplicated. Also, different centers used different somatic variant callers, and some believe that we should not therefore analyse these different datasets together. If you're getting the restricted data, you may have access to the more traditional VCF files, and possibly even the aligned BAM files prior to calling variants, which would make your work a lot easier from my perspective.
We don't know what your hypotheses are - the type of statistical test(s) will depend on this

That's all that I can think of right now, but I think that I've emphasised what are some of the critical issues of just managing and understanding the TCGA data when you get it.