Question: Why MAF file number not match with case number in firebrowse?
0
gravatar for shoujun.gu
8 months ago by
shoujun.gu370
Rockville/MD
shoujun.gu370 wrote:

I found there are 460 cases in COAD from Firebrowse. But in the downloaded Mutation Annotation files, there are only 154 MAF files. I'd like to know why there is significantly less MAF files than case number?

And in the downloaded Raw Mutation Annotation files, there are 367 MAF files (still less than case number). What's the difference between Raw Annotation MAF and Annotation MAF? Why half of the MAF files are filtered in Annotation MAFs?

Thank you.

snp next-gen genome • 324 views
ADD COMMENTlink modified 5 months ago by Biostar ♦♦ 20 • written 8 months ago by shoujun.gu370

Please quote the exact sources of the data that you have downloaded. There can be one or more from many reasons for the discrepancies in the numbers. Note that Firebrowse is a 'third party' and is not the NIH. Firebrowse took the data that the NIH produced and then provided their own processing methodologies. Also consider that the numbers can be explained by variant calls in both normal and tumour tissues, or somatic variant calls in the tumours with respect to the normals. Replicate tumour and normal samples may also have been combined by some rule.

ADD REPLYlink modified 8 months ago • written 8 months ago by Kevin Blighe49k

The Annotation mutation file I download is: Mutation_Packager_Calls (MD5) with file name: gdac.broadinstitute.org_COAD.Mutation_Packager_Calls.Level_3.2016012800.0.0.tar.gz

The Raw Annotation mutation file I download is: Mutation_Packager_Raw_Calls (MD5) with file name: gdac.broadinstitute.org_COAD.Mutation_Packager_Raw_Calls.Level_3.2016012800.0.0.tar.gz

ADD REPLYlink written 8 months ago by shoujun.gu370
1

Thanks! What happened with the TCGA data was that, after a certain period of time, they 'froze' the processing of new samples so that they could actually publish the work. Since the publications, many 1000s of new samples have been processed by the various TCGA centers, and the data subsequently made available. This is why the TCGA project is still very much ongoing (but I do not know much about the funding picture).

So, what Broad Institute (Firebrowse) did was that they continually stayed up to date with all new samples being produced by the TCGA centers. I cannot comment on the naming convention of 'raw MAF', but, in any case, the discrepancy is explained by this.

There is more information through these links:

If I actually go to the GDC Data Portal right now, which is the primary 'source' of the TCGA data, I see 4 MAF files that have >400 cases. Please visit this A Configured Search

Ultimately, do not get too disheartened by the numbers not agreeing. This always happens with TCGA data. If I need TCGA data, I usually take it from the GDC Data Portal and NOT a third party. The third party providers, I find, do not organise their data very well, and confusion arises a lot.

ADD REPLYlink written 8 months ago by Kevin Blighe49k

Thank you for your reply!

The reason I looked for the data in Firebrowse is because they provided the normalized RNASeq data between cases, where GDC data portal do not have (correct me if I'm wrong).

ADD REPLYlink written 8 months ago by shoujun.gu370
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2161 users visited in the last hour