Question: Duplicate samples in TCGA Breast cancer data. Which one to pick?
0
gravatar for Vasu
16 months ago by
Vasu340
Vasu340 wrote:

Hi,

I have downloaded TCGA breast cancer data. A total of 1256 fastq files. I have the UUID's. So I used "Genomics Data Commons" package to get the TCGA-Barcodes for those UUID's. But I see duplicate matching sample names. Which one should I pick for the analysis?

         UUID                               samplenames
5516dd59-3d95-4bc6-84e7-5719b1bbcabf    TCGA-A7-A26F-01B
a907f2d1-92ad-4a1b-b439-20e5a7347d5b    TCGA-A7-A26F-01A
b570a72f-5e6c-4301-923b-9992662409ca    TCGA-A7-A26F-01B
ba22d7e6-3e70-4a43-9dc1-59069b39e8c2    TCGA-A7-A26F-01B
eb068925-2dcc-4e18-838f-903ac8d2b661    TCGA-A7-A26F-01A
gdc rna-seq breast tcga • 1.4k views
ADD COMMENTlink written 16 months ago by Vasu340

See: Different TCGA file IDs with same the Sample ID and Samples with same TCGA barcode in TCGA data

ADD REPLYlink modified 16 months ago • written 16 months ago by genomax70k

yes But for gdc legacy data I dont see any aliquots like given in gdc harmonized data.

https://portal.gdc.cancer.gov/legacy-archive/files/a907f2d1-92ad-4a1b-b439-20e5a7347d5b

ADD REPLYlink modified 16 months ago • written 16 months ago by Vasu340

@Sean Davis Could you please tell me this. With "Genomics Data Commons" package I got the submitter id's for UUID's. But there are duplicates. Which one should I pick? I dont even have the plate number to select the samples. Is there way to get the whole TCGA-Barcode like "TCGA-A6-6781-01A-22R-A278-07" from UUID's so that I can select based on plate numbers.

ADD REPLYlink written 16 months ago by Vasu340
1

Tagging: Sean Davis

ADD REPLYlink written 16 months ago by genomax70k
1
gravatar for Kevin Blighe
16 months ago by
Kevin Blighe46k
Kevin Blighe46k wrote:

I've just done some further investigating. It's possible to locate the full barcode of these files

Essentially, as you can already tell, the following 3 UUIDs belong to the same aliquot:

One can tell this by the matched short TCGA barcode (TCGA-A7-A26F-01B), and also the matched Entity ID and Case ID on their respective GDC Legacy Archive records. The full TCGA barcode of these is: TCGA-A7-A26F-01B-04R-A22O-07

-------------------------------------------------------------

For the other 2 samples:

These have the same Case ID as the other samples, but a different matched Entity ID, thus, a different aliquot. Their full TCGA barcode is: TCGA-A7-A26F-01A-21R-A169-07

-------------------------------------------------

Edit 27th September 2018

In situations where you have a duplicate short TCGA barcode / sample, Broad Institute recommends to take the sample with the "highest lexicographical sort value" for the plate number - see HERE and HERE. The plate number is the penultimate segment of the full TCGA Barcode.

Kevin

ADD COMMENTlink modified 10 months ago • written 16 months ago by Kevin Blighe46k

How to download Biospecimen XML file? I don't see any download option for this GDC legacy

ADD REPLYlink written 16 months ago by Vasu340
1

I edited the final part of my comment since you posted yours.

  1. Go here: https://portal.gdc.cancer.gov/cases/3b7b9c1e-a84c-47ed-983c-9e4b00cbf01a?bioId=2a4747b5-1eeb-45b1-9e92-0e0e3d7a9c1b
  2. Search for nationwidechildrens.org_biospecimen.TCGA-A7-A26F.xml on the page
  3. Download the XML file and opn it
  4. Search for the Entity IDs for your samples
ADD REPLYlink written 16 months ago by Kevin Blighe46k

Ok. So, for all the duplicate samples I have to download XML. And the link you gave is harmonized, but the data I downloaded is from gdc legacy.

ADD REPLYlink written 16 months ago by Vasu340
1

There is only 1 XML biospecimen file for the TCGA patient whose barcode is TCGA-A7-A26F. If you search for the 2 Entity IDs that you have for your 5 samples in that biospecimen XML, then you'll see the full TCGA barcode.

Further investigation leads me to advise you to not use the 01B samples. Going by the biospecimen data, these are from a FFPE validation that was originally performed. Use the 01A sample and treat them as replicate RNA-seq samples in your study.

ADD REPLYlink modified 16 months ago • written 16 months ago by Kevin Blighe46k

Just now checked it is available in legacy gdc also. Thank you !!

ADD REPLYlink written 16 months ago by Vasu340
1

Oh, yes, it should be there too. Please read my latest comment too. It appears that the 01B samples are FFPE, so, that's justification enough to not use those.

ADD REPLYlink written 16 months ago by Kevin Blighe46k

Sure. thank you very much !!

ADD REPLYlink written 16 months ago by Vasu340
1

Sorry, I'm now just confirming for anyone else coming here as to what I am looking to gauge whether it's FFPE or not.

Here are lines from the biospecimen for your 2 Entity IDs (note the reference to FFPE):

  • TCGA Barcode: TCGA-A7-A26F-01A-21R-A169-07
  • Entity ID: 2a4747b5-1eeb-45b1-9e92-0e0e3d7a9c1b
  • File UUIDs: a907f2d1-92ad-4a1b-b439-20e5a7347d5b; eb068925-2dcc-4e18-838f-903ac8d2b661

01a


  • TCGA barcode: TCGA-A7-A26F-01B-04R-A22O-07
  • Entity ID: 1b907925-b33c-4e4a-96e0-65f15b4712b9
  • File UUIDs: 5516dd59-3d95-4bc6-84e7-5719b1bbcabf; b570a72f-5e6c-4301-923b-9992662409ca; ba22d7e6-3e70-4a43-9dc1-59069b39e8c2

01bn

ADD REPLYlink modified 16 months ago • written 16 months ago by Kevin Blighe46k

So, I can say that from my question I can select only for sample which will be TCGA-A7-A26F-01A. But still two UUID's has same TCGA-Barcode "TCGA-A7-A26F-01A-21R-A169-07". So from two these two I see that UUID - "a907f2d1-92ad-4a1b-b439-20e5a7347d5b" is with size 10 GB (fastq) and other UUID is eb068925-2dcc-4e18-838f-903ac8d2b661 with 13 GB size fastq. Which one should I prefer?

ADD REPLYlink written 16 months ago by Vasu340

Check the quality of both files in the non-FFPE sample (01A). File size is no reflection of quality of the reads.

ADD REPLYlink written 16 months ago by Kevin Blighe46k

You mean I need to take take both files for alignment and then check the reads? Is that what you are saying or any other?

ADD REPLYlink written 16 months ago by Vasu340

You can use something like FASTQC from The Babraham Institute in order to look at the FASTQ qualities. You can also then gauge quality post-alignment, such as alignment percent and unique alignments.

ADD REPLYlink written 16 months ago by Kevin Blighe46k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1638 users visited in the last hour