I am trying to generate a Site Frequency Spectrum from the TCGA database. However I am having some trouble; according to my algorithm, every single allele from the cancerous tissue matches that of the non-cancerous tissue (only looking at homozygous portions of data). I think that this is because there is an error in my method.
In my current implementation , I am assuming that the VCF files from the TCGA database are corresponding to only non-cancerous tissue. Here is the specific VCF file I am using . In particular I am trying to generate a site frequency spectrum for case 001cef41-ff86-4d3f-a140-a647ac4b10a1 in the TCGA breast cancer database.
I was working on this project a while ago and have since forgotten why I thought all of the VCF files were referring to non-cancerous tissue. Furthermore I do not know how to verify that this is or is not this is the case.
My questions are :
- Do the VCF files in this database correspond to non-cancerous tissue?
- How can I identify what files are corresponding to non-cancerous tissues? I am currently only able to do this for the BAMs. Info for BAMs
- If I am wrong about VCF files being non-cancerous, how should I proceed to generate the site frequency spectrum?