problem in matching the names between file names and patients Id in TCGA
1
2
Entering edit mode
4.2 years ago

Hi all,

I have downloaded total CNV files for a cancer from GDC portal.

I also have the clinical data for all patients, however I cannot map the names of file to submitter IDs.

The file name is some thing like "AMAZE_p_TCGASNP_b86_87_88_N_GenomeWideSNP_6_A01_735406.hg18.seg.txt", while the submitter ID is something like "TCGA-DJ-A2QA".

Can any body guide me how to mach these two names?

Nazanin

TCGA CNV problem in matching • 3.9k views
0
Entering edit mode

0
Entering edit mode

"AMAZE_p_TCGASNP_b86_87_88_N_GenomeWideSNP_6_A02_735476.hg18.seg.txt"

I want to map these file to the clinical file that I have previously downloaded.

In the clinical file only submitter ID and patients ID is available.

0
Entering edit mode

sample names might be inside the text files. Did you check the headers of the files?

0
Entering edit mode

Hi,

No the header just includes the results, something like this: "Sample Chromosome Start End Num_Probes Segment_Mean AMAZE_p_TCGASNP_b86_87_88_N_GenomeWideSNP_6_A01_735406 1 51598 9250000 4679 0.0076 AMAZE_p_TCGASNP_b86_87_88_N_GenomeWideSNP_6_A01_735406 1 9250070 9324990 55 0.5138"

0
Entering edit mode

Do you have the annotations.txt file coming with the CNV files ?

In this file you will have the entity_id which could also be found in the clinical files

0
Entering edit mode

Hi, Yes I have also downloaded the annotation file. However it does not include the names of CNV files that I can use for matching. The following is the header of annotation file:

"category   classification  entity_type created_datetime    annotation_id   case_submitter_id   project/project_id  entity_submitter_id id
Alternate sample pipeline   Notification    case    2012-11-13T00:00:00 29ba39af-b266-547a-b2c9-7795eba2e202    TCGA-AB-2822    TCGA-LAML   TCGA-AB-2822    29ba39af-b266-547a-b2c9-7795eba2e202
History of unacceptable prior treatment related to a prior/other malignancy Notification    case    2014-06-16T00:00:00 3d086829-de62-5d08-b848-ce0724188ff0    TCGA-AG-A014    TCGA-READ   TCGA-AG-A014    3d086829-de62-5d08-b848-ce0724188ff0
Center QC failed    CenterNotification  aliquot 2012-07-20T00:00:00 5cf05f41-ce70-58a3-8ecb-6bfaf6264437    TCGA-13-0913    TCGA-OV TCGA-13-0913-02A-01R-1564-13    5cf05f41-ce70-58a3-8ecb-6bfaf6264437
History of unacceptable prior treatment related to a prior/other malignancy Notification    case    2014-06-16T00:00:00 c53f22b1-677b-5528-a438-39d5390e2c68    TCGA-21-1077    TCGA-LUSC   TCGA-21-1077    c53f22b1-677b-5528-a438-39d5390e2c68
"

0
Entering edit mode

Coming with your AMAZE_p_TCGASNP_b86_87_88_N_GenomeWideSNP_6_A01_735406.hg18.seg.txt you have an annotation file where you can find an entity_id which I think in this case is this one 29ba39af-b266-547a-b2c9-7795eba2e202 corresponding to case_id in your clinical file.

To check

0
Entering edit mode

The problem is I have downloaded the CNV files for 507 patients with TCGA2bed. I know that I can find the patients or submitter ID via GDC, but I cannot do this for all 507 cases manually and I am seeking a way to find the equal patients or submitter ID automatically.

In other word, I want to find the patients or submitter ID based on "AMAZE_p_TCGASNP_b86_87_88_N_GenomeWideSNP_6_A01_735406.hg18.seg.txt". In annotation file there is no column including part of this "AMAZE_p_TCGASNP_b86_87_88_N_GenomeWideSNP_6_A01_735406.hg18.seg.txt" name.

0
Entering edit mode

What are the commands you used ?

0
Entering edit mode

TCGA2bed is a graphical tool in which toy can select bet ween annotation and experiment. After selecting tumor type, you have to select the type of data: CNV,RNASeq,...

0
Entering edit mode

As I don't know this API and it's not open source, I can't really help you more. In your CNV files you have sample names, you can try to get a list of it.

Then, I found this in R (https://cran.r-project.org/web/packages/TCGAretriever/TCGAretriever.pdf) Which I think you can request TCGA database with your list of sample names.

Or you can try to contact persons from this publication (https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1419-5)

0
Entering edit mode

Hi,

I'm getting an error:

 Error in UseMethod("filter") :
no applicable method for 'filter' applied to an object of class "c('gdc_files', 'GDCQuery', 'list')"


0
Entering edit mode

Please actually show the code that produced the error

4
Entering edit mode
4.2 years ago

Edit: original function written by Bioinfo (via Sean Davis' blog) for translating file UUIDs into TCGA barcodes ( C: Sample names for TCGA data from GDC-legacy archive ). This function (below) translates file names into TCGA barcodes.

A manual lookup of 507 samples is not that bad, if the desire is really there to get the work done. I have done manual lookups of >1000 TCGA samples back when there were no automated services.

library(GenomicDataCommons)
library(magrittr)

TCGAtranslateID = function(file_names, legacy = TRUE) {
info = files(legacy = legacy) %>%
filter( ~ file_name %in% file_names) %>%
select('cases.samples.submitter_id') %>%
results_all()

id_list = lapply(info\$cases, function(a) {
a[[1]][[1]][[1]]})

barcodes_per_file = sapply(id_list,length)

return(
data.frame(
file_id = rep(ids(info), barcodes_per_file),
submitter_id = unlist(id_list),
row.names=file_names))
}

TCGAtranslateID('AMAZE_p_TCGASNP_b86_87_88_N_GenomeWideSNP_6_A01_735406.hg18.seg.txt')


## Output

                                 file_id                               submitter_id
AMAZE_p_TCGASNP_b86_87...seg.txt 6352ceaf-99f4-4b74-94a2-dc5e405543f0  TCGA-BJ-A0Z9-01A

0
Entering edit mode

Hi Kevin,

Thank u so much. It worked

0
Entering edit mode

No problem. This function should also accept an entire vector of filenames, like:

c("filename1", "filename2", "filename3", "filename4",...)

0
Entering edit mode

Hi Kevin, I run the code successfully.

However I faced with another problem again. I have 1026 file names, however only 1020 IDs were found.

More over I did not get the file names (AMAZE_p_TCGASNP_b86_87...seg.txt) in the results to map them to my original input. The results include file_id(fda02baa-b6ba-47cd-88d2-20bd14a193a4) , submitter_ID (fda02baa-b6ba-47cd-88d2-20bd14a193a4) and third column (TCGA-BJ-A0Z9-01A).

Do I have to include "file_name" in "return(data.frame(file_id=rep(ids(info),barcodes_per_file), submitter_id=unlist(id_list), row.names=file_names))"?

1
Entering edit mode

I could get the full description of my files finally.

Traffic: 1627 users visited in the last hour
FAQ
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.