I would like to identify ICGC WGS data for my academic project. Our aim is to check specific locations across WGS variant calls and (hopefully proove) in these specific mutations the chance of INDEL is higher.
As we all know that there are currently available VCF to work with. However, I dont know which information should I get. Could you guide me a little bit? ( I know about vcftool)
For example, I am aiming to produce a data frame that will contain information about each patient (as columns) and locations in the rows. But, vcf files have different locations as the way it is.
Could you point me out methods or maybe papers to ways to evalutate that big of information? Any idea? I am feeling overwhelmed with the idea of having too much information and have 0 results.
Thank you very much for your help,
PS: I apologise for speaking too general because my PI insisted on keeping the stuff secret. SAD :(