I would like to seek some advice regarding a kind of analysis I am planning to perform with my exome data. I have sequenced data from two patients of specific cancer. I have also sequenced the peripheral blood as matched control for the patients. I am also having sequence of tumor iPSCs ( which means we reprogrammed the tumor lines to its iPSC and then sequenced it). We do not have a control exome data of normal iPSC(we did not reprogram normal fibroblasts to generate normal iPSCs as control for the tumor iPSCs) here. So the somatic variants for the iPSC is being obtained from normal peripheral blood exome / iPSC derived from tumor pair. So for each patients I have 4 samples for which exome sequencing is done. 1 normal, 1 tumor and 2 iPSC lines . My idea is to find the mutational landscape that is conserved from tumor to its tumor reprogrammed clone. We are not considering the dosage effect or the number of passages at which the reprogramming is done, so clearly there might be a selective advantage of mutations due to reprogramming that might occupy the majority of the IPSC clone. We know that the tumor is polyclonal and the IPSC is a single clone so the IPSC should contain the mutation that is actually spread in highest frequency in the tumor clones (barring the fact of selective advantage and other acquired mutation due to reprogramming). Still I can expect some mutation will pass to iPSC and gain precision from the tumor and also have elevated frequency. To this I employed established variant callers to fish out somatic variants from my samples and tried to find the to what extent these somatic variants are actually conserved in the tumor iPSCs. The overlap was fairly not convincing enough and the extent is roughly 44%. Now I want to do a check of these variants across all somatic mutations that I can obtain from TCGA for all tumor types. I have not worked with MAF files from TCGA much but after some studies on posts and websites I figured out we do not have a comprehensive mutation file that catalogs somatic mutations for all cancer types. We have it at individual level for each cancer types. I am interested to see the somatic variants which I have extracted for my samples(since they are not from large cohort of samples), are they somehow significantly observed as cancer related mutations across all types of cancer and I did not obtain them by chance. This would ensure me that even the mutational burden that the iPSC has, even not an exact mimic of its tumor but still the mutations are relevant and tumorigenic. This will give me a fist hand validation on my variants. Now my question is how do I obtain such a mutation file which will be having somatic mutations across most of the cancer types which its genomic loci, gene name, read statistics to which I can try to interrogate my variant data. Can this be achieved? Shall I do it separately across different cancer types taking up the MAF files for each tumor type and interrogate my somatic variants with them? This is what I want to achieve as of now. I would like some inputs out here from people out here. If someone has some other ideas I would like to know about it as well. Which data should I be consulting for this. I am sure it should be the MAF but am a bit lost among the TCGA consortium. Any leads?
Thanks and Regards
VD
@Cyriac Kandoth
Thanks a lot for the detailed note on the analysis. I have already used VarScan and Mutect and GATK as well. GATK did not work out the way I was expecting since I have a lot of heterogenity in my tumor sample. I believe the fact of having sub clonal mutation in my tumor samples there is a lot of noise and for this reason these mutations are gaining higher cell fraction in the iPSC, but is there any way to extract them? I have used Mutect for which the mutations which I found are much more than that detected by VarScan. Obviously the sharing level between tumor and iPSCs does not change much. My data is not that deep as far as recent exome experiments are concerned and the evolution of the technology. My normal and tumor samples are sequenced at 70X and iPSCs are at 35X. Which is not very deep but we expected that this coverage was enough to extract the mutations and shared context of mutational events. But now I feel since the subclonal mutations are taking precision I would like to do see if I can get a deeper sequencing done on my samples. Also the fact is am having just 2 tumors and match normal of them and 2 iPSCs for each of the tumor which is not that a big cohort where the two tumors are of different grades. I will try with the callers advised and see the effect and also will do the matching against the TCGA. Thank you for the suggestion.
Regards,
VD
MuTect/VarScan skip calling mutations with insufficient supporting reads. So for all somatic mutations detected in the IPSC, try using tools like samtools mpileup or bam-readcount to find at least a few reads that support the same variant in the tumor sample. Even 1 or two reads supporting the same variant can be sufficient evidence, if it is safe to rule it out as germline or a recurrent artifact. You can try fpfilter, a script that runs bam-readcount to collect evidence for or against a given list of variants.
@Cyriac Kandoth
I have some problem with the output of the fpfilter file. I can filter it out to form the tab delimited high confidence SNVs but I cannot convert it with the vcf-annotate file. I am using the below command. Do I have to provide the description text and annotation text to get the desired vcf file for the fpfilter out file? I am sorry am asking in this thread but its the immediate downstream analysis of the samples. Is it necessary to pass annotation and description text? I believe it should directly convert using the fpfilter out file . Below is the command and the error am using. I could not find any assistance so am putting it here. Thanks.
Error: