Dear Friends,
I am new to TCGA data analysis. I would really appreciate your suggestions on these questions:
a) From TCGA vcf files, I am looking to generate manhattan plots and qq plots to detect the association of SNPs with the traits? I know to generate manhattan plots we need these info:
CHR: chromosome (aliases chr, chromosome)
BP: nucleotide location (aliases bp, pos, position)
SNP: SNP identifier (aliases snp, rs, rsid, rsnum, id, marker, markername)
P: p-value for the association (aliases p, pval, p-value, pvalue, p.value)
"CHR", "BP, "SNP" are in the vcf files, so where to get the "P-value" from?
And for QQ plots also where to get the observed and expected p-value?
b) What type of plot should be generated to best present the number of variants for each tumor in each cancer type in vcf files? Could you please let me know where to find the information of tumor and the cancer type for SNPs in vcf files?
Thank you very much! DK
As far as I know, TCGA does not calculate association p-values. Although there may be independent resources where that information is available.
Thanks! can you let me know where the p-values can be obtained from for each SNPs?
Igor was just saying that such data may exist... somewhere. I have never seen such data, but it could exist. One resource that may have something similar is cBioPortal.
May I ask what you are trying to do? Manhattan plots were mainly used for GWAS, not cancer data. Of course, the can be used to plot anything. I believe that we have already identified the mutational landscape of tumours (?)
Thanks! Yes, am doing GWAS study, and I have vcf files to perform the above mentioned studies. Now, am trying to plot manhattan plot and QQ plot to detect the association of SNPs with the traits. Since am lacking p-values for the SNPs I am not able to plot them. Please let me know if am clear and if you know how one can proceed with these plots?
So, you need to know how to perform an association test from the VCF stage? What I would do is convert the data into plink format, and then do the association testing there. I have done this man times in the past, in fact.
Another program, SnpSift CaseControl, can perform the testing and encode the p-values within your VCF, which may be easier for you.
Thanks! Yes, so, I need to use plink to get the p-values of the SNPs from the vcf files, right? Could you please guide me to the plink steps source where I could learn on how to perform this? I am new to this. Thanks for understanding.
Sure, you just need the
--vcf
flag: How unphased VCF is converted into ped file?However, when doing this, plink apparently distorts the order of the samples in your VCF. So, you should 'fix' the ordering of your samples from the very first step and then supply a custom FAM file or all analyses. I cannot stress enough how important this is because otherwise you will be comparing sample groups that are not reflective of the actual groupings that you want.
What I said may not make much sense right now, but just go step by step and be 100% certain at each step that what you believe is happening is happening. It's easy to convert any VCF to plink, but not easy to maintain sample groupings.
See here: linkage disequilibrium analysis