Samples with same TCGA barcode in TCGA data
3
3
Entering edit mode
3.5 years ago
Vasu ▴ 600

Hi,

I had downloaded TCGA RNA-seq data using tcgabiolinks package. In the data I see that there few samples with same barcode. Which sample should I prefer?

For eg:

TCGA-A6-2684-01A-01R-1410-07
TCGA-A6-2684-01A-01R-A278-07


From TCGA barcode I see that plate and center are different but the sample ID (TCGA-A6-2684-01) is same. Which one should I prefer for the analysis? Do I need to keep both the samples? When I consider sample ID it will be duplicate samples.

RNA-Seq tcga gdc • 4.7k views
ADD COMMENT
4
Entering edit mode
3.5 years ago

In the example that you have given, they are indeed the same sample and are just different aliquots from solid normal tissue:

To get that screenshot, I went to the GDC and put one of the sample names into the search box.

It is for cases like this that I wished that the maintainers of these TCGA R packages did more curation of their data, but I admit that it is a lot of data and that, as researchers, we may not even have funding left for such things.

With regard to which one you choose, you could just obtain the mean value of both, or literally just discard 1. Either way, to report this in the methods is a single line which will be brushed over by reviewers.

Kevin

ADD COMMENT
0
Entering edit mode

Thanks for the reply. I'm thinking to check the number of genes with zero read counts. I will select the sample having less number of genes with zero read counts for further analysis. Do you think this is a good idea?

ADD REPLY
0
Entering edit mode

Yes, that is also a good idea.

ADD REPLY
0
Entering edit mode

I would also just check some of the other IDs that are like this, just to be sure.

ADD REPLY
0
Entering edit mode

Yes, in the first step I'm checking the sample ID in cbioportal and then in the next step I will check the gene count.

ADD REPLY
3
Entering edit mode
3.5 years ago
svlachavas ▴ 750

Just to add a quick note to Kevin's helpful answer-because there can be various duplicated samples, in any conditions, as there are also various guidelines that you could take into account, from the BROAD institute, like the following example from the READ dataset:

In detail, imagine you had the samples with the barcodes: "TCGA-A6-6650-01A-11R-1774-07" and "TCGA-A6-6650-01A-11R-A278-07" .

From their above approach, you should choose the latter because it has the aliquot with the later plate number.

Also, you could additionally remove any FFPE cases: http://gdac.broadinstitute.org/runs/sampleReports/latest/FPPP_FFPE_Cases.html

Hope that helps,

Efstathios

ADD COMMENT
0
Entering edit mode

Thanks for the information. You mean in the above example I need to select "TCGA-A6-6650-01A-11R-A278-07" ? plate number A278?

ADD REPLY
0
Entering edit mode

Yes that is correct

ADD REPLY
0
Entering edit mode

Thanks but I have a doubt. Please check this one.

TCGA-A6-2684-01A-01R-1410-07
TCGA-A6-2684-01A-01R-A278-07

TCGA-A6-6650-01A-11R-A278-07
TCGA-A6-6650-01A-11R-1774-07

TCGA-A6-2674-01A-02R-A278-07
TCGA-A6-2674-01A-02R-0821-07

TCGA-A6-3809-01A-01R-A278-07
TCGA-A6-3809-01A-01R-1022-07

TCGA-A6-6780-01A-11R-A278-07
TCGA-A6-6780-01A-11R-1839-07

TCGA-A6-3810-01A-01R-1022-07
TCGA-A6-3810-01A-01R-A278-07

TCGA-A6-5659-01A-01R-1653-07
TCGA-A6-5659-01A-01R-A278-07

TCGA-A6-6781-01A-22R-1928-07
TCGA-A6-6781-01A-22R-A278-07

TCGA-A6-5656-01A-21R-A278-07
TCGA-A6-5656-01A-21R-1839-07


In all these samples there is a plate number A278. So Do I need to select only those samples? And when I checked the Firebrowse (broadinst) data non of the samples are with plate number A278. What I need to do now?

ADD REPLY
2
Entering edit mode

Dear Bioinfo,

you should really not have a concern about this. As i can see from above, each "double set" of barcodes belongs to one different patient. So you could follow the above approach. Take also a look in the following link, which mentions all the above issues, as the possibility of the specific samples that might not included. But still, this approach is valid as they recommend:

https://confluence.broadinstitute.org/display/GDAC/FAQ#FAQ-sampleTypesQWhatTCGAsampletypesareFirehosepipelinesexecutedupon

Keep also in mind, as you use the TCGAbiolinks R package, to perform a quick initial filtering in order to remove first any FFPE samples:

For example, consider the output of the GDCprepare function of your downloaded dataset, as an object named cancer_data:

data_filt <- cancer_data[,  !cancer_data$is_ffpe]  ADD REPLY 0 Entering edit mode Thank you very much for the information. ADD REPLY 0 Entering edit mode I download the CNV data by TCGAbiolinks and find no cloumn called is_ffpe. Is there a easy way to apply the rules to multiple aliquot ?? ADD REPLY 0 Entering edit mode ADD REPLY 0 Entering edit mode Thanks, I wrote a function to solve this. ADD REPLY 0 Entering edit mode Thanks! Just to confirm: your function filters out FFPE? Does it do anything else? ADD REPLY 1 Entering edit mode This function filter barcode following rules provided by broad institute, including Analyte Replicate Filter and Sort Replicate Filter two parts. I update the function to filter samples marked FFPE by broad and test it. The 4th, 5th of barcode are FFPE samples. > tcga_replicateFilter(tsb = c("TCGA-55-7913-01B-11H-2237-01", "TCGA-55-7913-01B-11R-2237-01", "TCGA-55-7913-01B-11T-2237-01", "TCGA-44-2656-01B-06D-A271-08", "TCGA-44-2656-01B-06D-A273-01"), analyte_target = "RNA") ooo Filter barcodes successfully! [1] "TCGA-55-7913-01B-11H-2237-01" "TCGA-44-2656-01B-06D-A273-01" > tcga_replicateFilter(tsb = c("TCGA-55-7913-01B-11H-2237-01", "TCGA-55-7913-01B-11R-2237-01", "TCGA-55-7913-01B-11T-2237-01", "TCGA-44-2656-01B-06D-A271-08", "TCGA-44-2656-01B-06D-A273-01"), analyte_target = "RNA", filter_FFPE = TRUE, full_barcode = TRUE) ooo Filter barcodes successfully! [1] "TCGA-55-7913-01B-11H-2237-01"  BTW, TCGA do not use FFPE sample to generate data related to DNA or RNA analyses. TCGA only characterized samples that were frozen soon after surgery to prevent degradation of the RNA and DNA. FFPE (formalin fixed paraffin embedded) samples were not used because of potential changes to the RNA and DNA that may arise from the fixation process. It seems that broad institute provide FFPE cases: http://gdac.broadinstitute.org/runs/sampleReports/latest/FPPP_FFPE_Cases.html, but the type all FPPP. It is no need to worry this. ADD REPLY 1 Entering edit mode 3.3 years ago Shixiang ▴ 80 Following the rules from broad institute click, I wrote a R function to solve this problem and test it using LUAD tumor barcode. I put it on Github gist tcga_replicateFilter.R due to word limit on Biostars. This function can help people have related questions. I just check the data @bioinfo showed, test_data = read_tsv(" TCGA-A6-2684-01A-01R-1410-07 TCGA-A6-2684-01A-01R-A278-07 TCGA-A6-6650-01A-11R-A278-07 TCGA-A6-6650-01A-11R-1774-07 TCGA-A6-2674-01A-02R-A278-07 TCGA-A6-2674-01A-02R-0821-07 TCGA-A6-3809-01A-01R-A278-07 TCGA-A6-3809-01A-01R-1022-07 TCGA-A6-6780-01A-11R-A278-07 TCGA-A6-6780-01A-11R-1839-07 TCGA-A6-3810-01A-01R-1022-07 TCGA-A6-3810-01A-01R-A278-07 TCGA-A6-5659-01A-01R-1653-07 TCGA-A6-5659-01A-01R-A278-07 TCGA-A6-6781-01A-22R-1928-07 TCGA-A6-6781-01A-22R-A278-07 TCGA-A6-5656-01A-21R-A278-07 TCGA-A6-5656-01A-21R-1839-07", col_names=FALSE) tcga_replicateFilter(test_data$X1)


Result is

> tcga_replicateFilter(test_data\$X1)
ooo Filter barcodes successfully!
[1] "TCGA-A6-2684-01A-01R-A278-07" "TCGA-A6-6650-01A-11R-A278-07" "TCGA-A6-2674-01A-02R-A278-07"
[4] "TCGA-A6-3809-01A-01R-A278-07" "TCGA-A6-6780-01A-11R-A278-07" "TCGA-A6-3810-01A-01R-A278-07"
[7] "TCGA-A6-5659-01A-01R-A278-07" "TCGA-A6-6781-01A-22R-A278-07" "TCGA-A6-5656-01A-21R-A278-07"


If it is wrong, please let me know.

I also put it on Github

ADD COMMENT

Login before adding your answer.

Traffic: 2189 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6