Question: How to identify non-cancerous files in TCGA database and how to generate SFS
I am trying to generate a Site Frequency Spectrum from the TCGA database. However I am having some trouble; according to my algorithm, every single allele from the cancerous tissue matches that of the non-cancerous tissue (only looking at homozygous portions of data). I think that this is because there is an error in my method.

In my current implementation , I am assuming that the VCF files from the TCGA database are corresponding to only non-cancerous tissue. Here is the specific VCF file I am using . In particular I am trying to generate a site frequency spectrum for case 001cef41-ff86-4d3f-a140-a647ac4b10a1 in the TCGA breast cancer database.

I was working on this project a while ago and have since forgotten why I thought all of the VCF files were referring to non-cancerous tissue. Furthermore I do not know how to verify that this is or is not this is the case.

My questions are :

  1. Do the VCF files in this database correspond to non-cancerous tissue?
  2. How can I identify what files are corresponding to non-cancerous tissues? I am currently only able to do this for the BAMs. Info for BAMs
  3. If I am wrong about VCF files being non-cancerous, how should I proceed to generate the site frequency spectrum?
