MAF (minor allele frequency) calculation for finding rare variants from TCGA
3
2
Entering edit mode
7.6 years ago
imagineyd ▴ 70

Hi all,

I want to collect rare variants (1% MAF in population) from TCGA data sets.

Most cases of variants, I can find MAF from dbSNP or 1000 genomes site, but some cases I couldn't find MAF values.

For example, "rs80357604", there is no MAF information from dbSNP (http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=80357604).

So it means they are very rare and can't calculate MAF? or there are some technical problems?

Thanks so much :)

MAF TCGA snp rare-variant • 7.5k views
2
Entering edit mode
7.6 years ago

Update: I just realized you're working with genotype data from TCGA SNP arrays, and not the TCGA somatic mutation calls from exome-seq. But the answer below is still relevant, so I'll leave it as is.

TCGA reports only somatic variants seen in tumors, by strictly subtracting germline variants seen in a matched normal control. They have a strict policy of not reporting germline variants in their publicly downloadable mutation annotation files (aka MAFs). Read more about them here. If any TCGA somatic variant has a dbSNP ID, then it is likely to be a somatic mutation inadvertently submitted to dbSNP. Or it may be a germline variant incorrectly called as somatic because of reasons like poor coverage, allele specific amplification, paralog misalignments, etc.

Notice how the submission report for rs80357604 is poorly annotated, but it does mention that it came from clinical sequencing... which means the source tissue could very well have been a tumor.

For a given variant list, you can use Ensembl's VEP to generate MAFs (minor allele frequencies) based on 1000genomes and NHLBI EVS. To run VEP on a TCGA MAF (mutation annotation format) file, lookup the maf2maf.pl script available here.

1
Entering edit mode

Thanks so much :)

Actually I'm analyzing exome seq data with normal samples, so germline variants.

In my step, calling the rare variants is very important, so minor allele frequency (MAF) is very import issue.

So I'm so curious why so many germline variants from exome-seq data don't have MAF value.

If then, how can I select rare variants from exome-seq data? Could you give me some suggestions for me?

Thanks so much :)

1
Entering edit mode

Very cool! Not many people have access to consistently generated germline calls from TCGA exome-seq. I know several labs that generated these in-house, including the 3 TCGA GSCs, but NIH policies make it hard to share these variant lists with the public.

If you have many germline calls that are not seen in VCFs from 1000g or NHLBI EVS, then they are either false-positive calls (sequencing artifacts, misaligned paralogs, etc.) or they are really rare germline mutations. According to this paper, every newborn has ~70 denovo variants (fewer in exomes, of course). So even with their 6500+ exomes, NHLBI EVS doesn't have the power to detect all the world's rarest variants.

To rule out false-positives, you can try this tool that applies filters described by GATK or VarScan2. Also see the germline variant calling and filtering method section in this paper where we did germline calling across TCGA exomes.

1
Entering edit mode
7.6 years ago

IMHO, the best current place to collect population data is the 1000genomes project. in fact, it is the main population source for dbSNP, which is "just" a variation repository that links to population data where available, and not the other way round. 1000genomes was indeed conceived to detect rare variants, so I would suggest to play around with latest 1000genomes files, as you would be able to extract variants with MAF below 1% by looking at the AF (alllele frequency) tags through bcftools for instance:

for file in ALL.chr*.phase3_shapeit2_mvncall_integrated_v5.20130502.genotypes.vcf.gz; do
bcftools view -i 'AF<0.01 || AF>0.99' $file >${file/.vcf.gz/.maf0.01.vcf}
done


once you have this data, you could intersect it with TGCA data in order to get the subset of your interest. have in mind that TCGA is a biased dataset, since they focus on cancer genomes, and you would be using a generic dataset such as 1000genomes that could be missing many interesting findings from TCGA. you can never forget what hypothesis you have, what data is available, and how powerful that data is to test your hypothesis.

0
Entering edit mode

1000genomes was based on low-pass whole genome sequencing (2x-4x read depth per sample), which does not have the power to detect very rare germline events. If you're primarily interested in just coding regions, then you should be using minor allele frequencies from NHLBI's Exome Variant Server - where they did germline calling across exome-seq of 6500+ samples from mostly healthy individuals.

No need to do VCF intersect. Ensembl's VEP generates 1000genomes and NHLBI EVS MAFs (minor allele frequencies) for a given variant list. To run VEP on a TCGA MAF (mutation annotation format) file, lookup the maf2maf.pl script available here.

0
Entering edit mode
5.3 years ago
aliexs618 • 0

Thanks for your discussion, I also couldn't find the MAF(1000G_AF) for some novel SNPs in our study (such as rs112951749,rs201758122, rs200011964, rs369300887, etc) and for "SNP mutation without SNP ID".

However, I can find their EXAC_AF (Exome Aggregation Consortium allele frequency) and it looks to me those tow are highly related. Can anyone explains to me how to compare MAF and EXAC_AF?

Thanks a lot

Sam