Duplicate samples in TCGA Breast cancer data. Which one to pick?
4.6 years ago
Vasu

Hi,

I have downloaded TCGA breast cancer data. A total of 1256 fastq files. I have the UUID's. So I used "Genomics Data Commons" package to get the TCGA-Barcodes for those UUID's. But I see duplicate matching sample names. Which one should I pick for the analysis?

         UUID                               samplenames
5516dd59-3d95-4bc6-84e7-5719b1bbcabf    TCGA-A7-A26F-01B
b570a72f-5e6c-4301-923b-9992662409ca    TCGA-A7-A26F-01B
ba22d7e6-3e70-4a43-9dc1-59069b39e8c2    TCGA-A7-A26F-01B
eb068925-2dcc-4e18-838f-903ac8d2b661    TCGA-A7-A26F-01A

RNA-Seq tcga breast gdc
yes But for gdc legacy data I dont see any aliquots like given in gdc harmonized data.

@Sean Davis Could you please tell me this. With "Genomics Data Commons" package I got the submitter id's for UUID's. But there are duplicates. Which one should I pick? I dont even have the plate number to select the samples. Is there way to get the whole TCGA-Barcode like "TCGA-A6-6781-01A-22R-A278-07" from UUID's so that I can select based on plate numbers.

Sean Davis

4.6 years ago

I've just done some further investigating. It's possible to locate the full barcode of these files

Essentially, as you can already tell, the following 3 UUIDs belong to the same aliquot:

One can tell this by the matched short TCGA barcode (TCGA-A7-A26F-01B), and also the matched Entity ID and Case ID on their respective GDC Legacy Archive records. The full TCGA barcode of these is: TCGA-A7-A26F-01B-04R-A22O-07

For the other 2 samples:

These have the same Case ID as the other samples, but a different matched Entity ID, thus, a different aliquot. Their full TCGA barcode is: TCGA-A7-A26F-01A-21R-A169-07

Edit 27th September 2018

In situations where you have a duplicate short TCGA barcode / sample, Broad Institute recommends to take the sample with the "highest lexicographical sort value" for the plate number - see HERE and HERE. The plate number is the penultimate segment of the full TCGA Barcode.

Kevin

I edited the final part of my comment since you posted yours.

1. Go here: https://portal.gdc.cancer.gov/cases/3b7b9c1e-a84c-47ed-983c-9e4b00cbf01a?bioId=2a4747b5-1eeb-45b1-9e92-0e0e3d7a9c1b
2. Search for nationwidechildrens.org_biospecimen.TCGA-A7-A26F.xml on the page
4. Search for the Entity IDs for your samples
Ok. So, for all the duplicate samples I have to download XML. And the link you gave is harmonized, but the data I downloaded is from gdc legacy.

There is only 1 XML biospecimen file for the TCGA patient whose barcode is TCGA-A7-A26F. If you search for the 2 Entity IDs that you have for your 5 samples in that biospecimen XML, then you'll see the full TCGA barcode.

Further investigation leads me to advise you to not use the 01B samples. Going by the biospecimen data, these are from a FFPE validation that was originally performed. Use the 01A sample and treat them as replicate RNA-seq samples in your study.

Just now checked it is available in legacy gdc also. Thank you !!

Oh, yes, it should be there too. Please read my latest comment too. It appears that the 01B samples are FFPE, so, that's justification enough to not use those.

Sure. thank you very much !!

Sorry, I'm now just confirming for anyone else coming here as to what I am looking to gauge whether it's FFPE or not.

Here are lines from the biospecimen for your 2 Entity IDs (note the reference to FFPE):

• TCGA Barcode: TCGA-A7-A26F-01A-21R-A169-07
• Entity ID: 2a4747b5-1eeb-45b1-9e92-0e0e3d7a9c1b

• TCGA barcode: TCGA-A7-A26F-01B-04R-A22O-07
• Entity ID: 1b907925-b33c-4e4a-96e0-65f15b4712b9
• File UUIDs: 5516dd59-3d95-4bc6-84e7-5719b1bbcabf; b570a72f-5e6c-4301-923b-9992662409ca; ba22d7e6-3e70-4a43-9dc1-59069b39e8c2

So, I can say that from my question I can select only for sample which will be TCGA-A7-A26F-01A. But still two UUID's has same TCGA-Barcode "TCGA-A7-A26F-01A-21R-A169-07". So from two these two I see that UUID - "a907f2d1-92ad-4a1b-b439-20e5a7347d5b" is with size 10 GB (fastq) and other UUID is eb068925-2dcc-4e18-838f-903ac8d2b661 with 13 GB size fastq. Which one should I prefer?

Check the quality of both files in the non-FFPE sample (01A). File size is no reflection of quality of the reads.

You mean I need to take take both files for alignment and then check the reads? Is that what you are saying or any other?

You can use something like FASTQC from The Babraham Institute in order to look at the FASTQ qualities. You can also then gauge quality post-alignment, such as alignment percent and unique alignments.