TCGA --- MAF files: no values showing for case_id column
1
0
Entering edit mode
22 months ago

Hi,

I have dowloaded some MAF files form the GDC using the GDC client, and I need the case_id numbers so that I can merge the MAF file with another file also containing case_id numbers. The documentation on the GDC says the case_id column is column 116, but when I try to extract or visualize that column, I can see the header, but I can't see anything else. Any advice?

TCGA cancer WXS GDC GDC-client • 535 views
0
Entering edit mode
21 months ago

Instead, please use the following columns for the purpose of identifying samples:

• Tumor_Sample_Barcode (#16)
• Matched_Norm_Sample_Barcode (#17)
• Tumor_Sample_UUID (#33)
• Matched_Norm_Sample_UUID (#34)

Kevin

0
Entering edit mode

Hi Kevin,

I am trying to match the MAF files to the associated clinical data files so that I can get the gender information to match the MAF file information. The clinical data files do not have any of the columns you listed unfortunately. They have the case_id and submitter_id only. Do you know if I can find either files with the MAF information plus the gender, or a file that would contain the gender along with the case or submitter IDs.?

Thanks, Sam

0
Entering edit mode

Can you please confirm the format of the 'submitter_id'? - these are short TCGA barcodes, right (like this: TCGA-E9-A1NA-11A)? In this case, you should be able to match these to the MAF files.

0
Entering edit mode

Here is an example of the submitter_id "TCGA-4Z-AA7N" and I can't find a matching column in the MAF file.

Thanks, Sam

0
Entering edit mode

The Tumor_Sample_Barcode and Matched_Norm_Sample_Barcode are not similar to this format?

0
Entering edit mode

it seems as if it they would partially match and since I am trying to use linux 'join' to merge the file based on column matches, I am not sure if that would be sufficient .

Thanks, Sam

0
Entering edit mode

You can use both of these 'barcodes' to match and to achieve what you want. The longer barcodes just contain some extra information that is not exactly required in this situation. Take a look: Meaning letters in TCGA sample barcode field

You do not have to use join to achieve what you want.

0
Entering edit mode

Kevin,

Thanks for your help so far. Without 'join' how could I merge the 2 files to get all the information matched in one file for analysis? The files have 10s of thousands of lines, so I can't manually match them. I am new to this, I would really appreciate some guidance how to accomplish that without using join.

When I sort your suggested columns the tumor_sample_barcode one ends up having a lot of the lines marked as "somatic", so they can't be matched either.

0
Entering edit mode

Yes, do not worry. Please tell me the ultimate aim of your work so that I can understand the desired end-format?

The clinical data that you have will has, I believe, 1 row per sample; whereas, the MAF file has hundreds or thousands of rows per sample (one row per each somatic mutation). Do you want a final table that has the same number of rows as the MAF file but with extra columns for the clinical data?

What are you comfortable using? - Python?; R?; JAVA?; shell scripting?

0
Entering edit mode

Hi Kevin,

Thanks for all your help so far. Roughly, I am trying to figure out which proportion of the mutations are associated with males versus female patients, so ideally a final table that has the same number of row as the maf file but with extra columns for the clinical data(mainly the gender). I am comfortable with Python and shell scripting.

I am also considering re-downloading the files from the GDC site filtered by gender using the GDC client in the command line. DO you think this could be an efficient way of doing this?

Thanks, Sam

0
Entering edit mode

Hey Sam, re-downloading it based on sex/gender may indeed be the easiest option in this case.

1
Entering edit mode

Thank you very much for taking the time to help me.

Sam