Question: problem in matching the names between file names and patients Id in TCGA
1
gravatar for nazaninhoseinkhan
19 days ago by
Iran, Islamic Republic Of
nazaninhoseinkhan240 wrote:

Hi all,

I have downloaded total CNV files for a cancer from GDC portal.

I also have the clinical data for all patients, however I cannot map the names of file to submitter IDs.

The file name is some thing like "AMAZE_p_TCGASNP_b86_87_88_N_GenomeWideSNP_6_A01_735406.hg18.seg.txt", while the submitter ID is something like "TCGA-DJ-A2QA".

Can any body guide me how to mach these two names?

Thank you in advance

Nazanin

cnv tcga problem in matching • 199 views
ADD COMMENTlink written 19 days ago by nazaninhoseinkhan240

Could you give us an example, for one patient, of what you have downloaded with links and/or pictures please ?

ADD REPLYlink written 19 days ago by Bastien Hervé1.3k

I have downloaded the whole CNV files using TCGA2bed software.

These are some cnv files which have been downloaded: "AMAZE_p_TCGASNP_b86_87_88_N_GenomeWideSNP_6_A01_735406.hg18.seg.txt",

"AMAZE_p_TCGASNP_b86_87_88_N_GenomeWideSNP_6_A02_735476.hg18.seg.txt"

I want to map these file to the clinical file that I have previously downloaded.

In the clinical file only submitter ID and patients ID is available.

ADD REPLYlink written 19 days ago by nazaninhoseinkhan240

sample names might be inside the text files. Did you check the headers of the files?

ADD REPLYlink written 19 days ago by cpad01126.4k

Hi,

No the header just includes the results, something like this: "Sample Chromosome Start End Num_Probes Segment_Mean AMAZE_p_TCGASNP_b86_87_88_N_GenomeWideSNP_6_A01_735406 1 51598 9250000 4679 0.0076 AMAZE_p_TCGASNP_b86_87_88_N_GenomeWideSNP_6_A01_735406 1 9250070 9324990 55 0.5138"

ADD REPLYlink written 19 days ago by nazaninhoseinkhan240

Do you have the annotations.txt file coming with the CNV files ?

In this file you will have the entity_id which could also be found in the clinical files

ADD REPLYlink written 19 days ago by Bastien Hervé1.3k

Hi, Yes I have also downloaded the annotation file. However it does not include the names of CNV files that I can use for matching. The following is the header of annotation file:

"category   classification  entity_type created_datetime    annotation_id   case_submitter_id   project/project_id  entity_submitter_id id
Alternate sample pipeline   Notification    case    2012-11-13T00:00:00 29ba39af-b266-547a-b2c9-7795eba2e202    TCGA-AB-2822    TCGA-LAML   TCGA-AB-2822    29ba39af-b266-547a-b2c9-7795eba2e202
History of unacceptable prior treatment related to a prior/other malignancy Notification    case    2014-06-16T00:00:00 3d086829-de62-5d08-b848-ce0724188ff0    TCGA-AG-A014    TCGA-READ   TCGA-AG-A014    3d086829-de62-5d08-b848-ce0724188ff0
Center QC failed    CenterNotification  aliquot 2012-07-20T00:00:00 5cf05f41-ce70-58a3-8ecb-6bfaf6264437    TCGA-13-0913    TCGA-OV TCGA-13-0913-02A-01R-1564-13    5cf05f41-ce70-58a3-8ecb-6bfaf6264437
History of unacceptable prior treatment related to a prior/other malignancy Notification    case    2014-06-16T00:00:00 c53f22b1-677b-5528-a438-39d5390e2c68    TCGA-21-1077    TCGA-LUSC   TCGA-21-1077    c53f22b1-677b-5528-a438-39d5390e2c68
"
ADD REPLYlink modified 19 days ago by genomax50k • written 19 days ago by nazaninhoseinkhan240

Coming with your AMAZE_p_TCGASNP_b86_87_88_N_GenomeWideSNP_6_A01_735406.hg18.seg.txt you have an annotation file where you can find an entity_id which I think in this case is this one 29ba39af-b266-547a-b2c9-7795eba2e202 corresponding to case_id in your clinical file.

To check

ADD REPLYlink modified 19 days ago • written 19 days ago by Bastien Hervé1.3k

The problem is I have downloaded the CNV files for 507 patients with TCGA2bed. I know that I can find the patients or submitter ID via GDC, but I cannot do this for all 507 cases manually and I am seeking a way to find the equal patients or submitter ID automatically.

In other word, I want to find the patients or submitter ID based on "AMAZE_p_TCGASNP_b86_87_88_N_GenomeWideSNP_6_A01_735406.hg18.seg.txt". In annotation file there is no column including part of this "AMAZE_p_TCGASNP_b86_87_88_N_GenomeWideSNP_6_A01_735406.hg18.seg.txt" name.

ADD REPLYlink written 19 days ago by nazaninhoseinkhan240

What are the commands you used ?

ADD REPLYlink written 19 days ago by Bastien Hervé1.3k

TCGA2bed is a graphical tool in which toy can select bet ween annotation and experiment. After selecting tumor type, you have to select the type of data: CNV,RNASeq,...

ADD REPLYlink written 19 days ago by nazaninhoseinkhan240

As I don't know this API and it's not open source, I can't really help you more. In your CNV files you have sample names, you can try to get a list of it.

Then, I found this in R (https://cran.r-project.org/web/packages/TCGAretriever/TCGAretriever.pdf) Which I think you can request TCGA database with your list of sample names.

Or you can try to contact persons from this publication (https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-016-1419-5)

ADD REPLYlink written 19 days ago by Bastien Hervé1.3k
2
gravatar for Kevin Blighe
19 days ago by
Kevin Blighe21k
University College London Cancer Institute
Kevin Blighe21k wrote:

Edit: original function written by Bioinfo for translating file UUIDs into TCGA barcodes (C: Sample names for TCGA data from GDC-legacy archive ). This function (below) translates file names into TCGA barcodes.

A manual lookup of 507 samples is not that bad, if the desire is really there to get the work done. I have done manual lookups of >1000 TCGA samples back when there were no automated services.

The one solution that I thought would work was this function:

library(GenomicDataCommons)
library(magrittr)

TCGAtranslateID = function(file_names, legacy = TRUE)
{
  info = files(legacy = legacy) %>%
    filter( ~ file_name %in% file_names) %>%
    select('cases.samples.submitter_id') %>%
    results_all()

  id_list = lapply(info$cases,function(a)
  {
    a[[1]][[1]][[1]]
  })

    barcodes_per_file = sapply(id_list,length)

    return(data.frame(file_id=rep(ids(info),barcodes_per_file), submitter_id=unlist(id_list), row.names=file_names))
}

TCGAtranslateID("AMAZE_p_TCGASNP_b86_87_88_N_GenomeWideSNP_6_A01_735406.hg18.seg.txt")

Output:

                                 file_id                               submitter_id
AMAZE_p_TCGASNP_b86_87...seg.txt 6352ceaf-99f4-4b74-94a2-dc5e405543f0  TCGA-BJ-A0Z9-01A

However, the TCGA appear to have even shut down access to their servers from this R package. Please do try it out, though, as you may have success.

If you cannot get that working, please contact the GDC support staff.

ADD COMMENTlink modified 14 days ago • written 19 days ago by Kevin Blighe21k

Hi Kevin,

Thank u so much. It worked

I'll never forget your helps

ADD REPLYlink modified 16 days ago • written 16 days ago by nazaninhoseinkhan240

No problem. This function should also accept an entire vector of filenames, like:

c("filename1", "filename2", "filename3", "filename4",...)
ADD REPLYlink written 16 days ago by Kevin Blighe21k

Hi Kevin, I run the code successfully.

However I faced with another problem again. I have 1026 file names, however only 1020 IDs were found.

More over I did not get the file names (AMAZE_p_TCGASNP_b86_87...seg.txt) in the results to map them to my original input. The results include file_id(fda02baa-b6ba-47cd-88d2-20bd14a193a4) , submitter_ID (fda02baa-b6ba-47cd-88d2-20bd14a193a4) and third column (TCGA-BJ-A0Z9-01A).

Do I have to include "file_name" in "return(data.frame(file_id=rep(ids(info),barcodes_per_file), submitter_id=unlist(id_list), row.names=file_names))"?

ADD REPLYlink written 7 days ago by nazaninhoseinkhan240
1

I could get the full description of my files finally.

Thank u all for your helps and comments

ADD REPLYlink written 6 days ago by nazaninhoseinkhan240
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 952 users visited in the last hour