I want to do an eQTL-Analysis on TCGA cancer data. Currently, I am stuck at the point of creating covariate files for the genotype data: I have TCGA somatic MAF files (downloaded from GDC data portal, see documentation here: https://docs.gdc.cancer.gov/Data/File_Formats/MAF_Format/ ) and want to use EIGENSOFT's smartpca on the data. However, I have to convert it to the correct format first (see https://reich.hms.harvard.edu/software/InputFileFormats ). While it seems that I can create most of the files on my own from the MAF, I get stuck at the SNP file, where I have to specify a SNP's genetic position in centiMorgans.
I therefore looked for tools that help me to automatically create these files and came across PLINK. According to related posts on Biostars (A: TCGA SNP data and TCGA SNP to plink ), I need to create a PED and MAP file and then use the --lfile operator to create a plink object. However, the MAP file also requires a centiMorgan position. They however note that the centiMorgan specification can be left out as it is crucial only for particular tasks.
Right now I am kind of confused on how to proceed: - Do I need to specify centiMorgans at all when using smartpca? - Is it reasonable to use plink to create input files for smartpca when I have to create MAP and and PED file on my own (which already correspond to the snp and indiv files required by EIGENSOFT/smartpca)? Will plink calculate the centiMorgan position of my SNPs even if I do not specify them in the MAP file?
I am grateful for any advice on how to proceed.