Question: Samples with same TCGA barcode in TCGA data
1
gravatar for Vasu
16 months ago by
Vasu340
Vasu340 wrote:

Hi,

I had downloaded TCGA RNA-seq data using tcgabiolinks package. In the data I see that there few samples with same barcode. Which sample should I prefer?

For eg:

TCGA-A6-2684-01A-01R-1410-07
TCGA-A6-2684-01A-01R-A278-07

From TCGA barcode I see that plate and center are different but the sample ID (TCGA-A6-2684-01) is same. Which one should I prefer for the analysis? Do I need to keep both the samples? When I consider sample ID it will be duplicate samples.

rna-seq gdc tcga • 2.1k views
ADD COMMENTlink modified 13 months ago by Shixiang40 • written 16 months ago by Vasu340
3
gravatar for Kevin Blighe
16 months ago by
Kevin Blighe46k
Kevin Blighe46k wrote:

In the example that you have given, they are indeed the same sample and are just different aliquots from solid normal tissue: Captura_de_tela_de_2018_04_08_21_39_06

To get that screenshot, I went to the GDC and put one of the sample names into the search box.

It is for cases like this that I wished that the maintainers of these TCGA R packages did more curation of their data, but I admit that it is a lot of data and that, as researchers, we may not even have funding left for such things.

With regard to which one you choose, you could just obtain the mean value of both, or literally just discard 1. Either way, to report this in the methods is a single line which will be brushed over by reviewers.

Kevin

ADD COMMENTlink modified 10 months ago • written 16 months ago by Kevin Blighe46k

Thanks for the reply. I'm thinking to check the number of genes with zero read counts. I will select the sample having less number of genes with zero read counts for further analysis. Do you think this is a good idea?

ADD REPLYlink written 16 months ago by Vasu340

Yes, that is also a good idea.

ADD REPLYlink written 16 months ago by Kevin Blighe46k

I would also just check some of the other IDs that are like this, just to be sure.

ADD REPLYlink written 16 months ago by Kevin Blighe46k

Yes, in the first step I'm checking the sample ID in cbioportal and then in the next step I will check the gene count.

ADD REPLYlink written 16 months ago by Vasu340
2
gravatar for svlachavas
16 months ago by
svlachavas570
Greece
svlachavas570 wrote:

Just to add a quick note to Kevin's helpful answer-because there can be various duplicated samples, in any conditions, as there are also various guidelines that you could take into account, from the BROAD institute, like the following example from the READ dataset:

http://gdac.broadinstitute.org/runs/sampleReports/latest/READ_Replicate_Samples.html

In detail, imagine you had the samples with the barcodes: "TCGA-A6-6650-01A-11R-1774-07" and "TCGA-A6-6650-01A-11R-A278-07" .

From their above approach, you should choose the latter because it has the aliquot with the later plate number.

Also, you could additionally remove any FFPE cases: http://gdac.broadinstitute.org/runs/sampleReports/latest/FPPP_FFPE_Cases.html

Hope that helps,

Efstathios

ADD COMMENTlink written 16 months ago by svlachavas570

Thanks for the information. You mean in the above example I need to select "TCGA-A6-6650-01A-11R-A278-07" ? plate number A278?

ADD REPLYlink written 16 months ago by Vasu340

Yes that is correct

ADD REPLYlink written 16 months ago by svlachavas570

Thanks but I have a doubt. Please check this one.

TCGA-A6-2684-01A-01R-1410-07    
TCGA-A6-2684-01A-01R-A278-07    

TCGA-A6-6650-01A-11R-A278-07    
TCGA-A6-6650-01A-11R-1774-07    

TCGA-A6-2674-01A-02R-A278-07    
TCGA-A6-2674-01A-02R-0821-07    

TCGA-A6-3809-01A-01R-A278-07    
TCGA-A6-3809-01A-01R-1022-07    

TCGA-A6-6780-01A-11R-A278-07    
TCGA-A6-6780-01A-11R-1839-07    

TCGA-A6-3810-01A-01R-1022-07    
TCGA-A6-3810-01A-01R-A278-07    

TCGA-A6-5659-01A-01R-1653-07    
TCGA-A6-5659-01A-01R-A278-07    

TCGA-A6-6781-01A-22R-1928-07    
TCGA-A6-6781-01A-22R-A278-07    

TCGA-A6-5656-01A-21R-A278-07    
TCGA-A6-5656-01A-21R-1839-07

In all these samples there is a plate number A278. So Do I need to select only those samples? And when I checked the Firebrowse (broadinst) data non of the samples are with plate number A278. What I need to do now?

ADD REPLYlink modified 16 months ago • written 16 months ago by Vasu340
2

Dear Bioinfo,

you should really not have a concern about this. As i can see from above, each "double set" of barcodes belongs to one different patient. So you could follow the above approach. Take also a look in the following link, which mentions all the above issues, as the possibility of the specific samples that might not included. But still, this approach is valid as they recommend:

https://confluence.broadinstitute.org/display/GDAC/FAQ#FAQ-sampleTypesQWhatTCGAsampletypesareFirehosepipelinesexecutedupon

Keep also in mind, as you use the TCGAbiolinks R package, to perform a quick initial filtering in order to remove first any FFPE samples:

For example, consider the output of the GDCprepare function of your downloaded dataset, as an object named cancer_data:

data_filt <- cancer_data[,  !cancer_data$is_ffpe]
ADD REPLYlink modified 16 months ago • written 16 months ago by svlachavas570

Thank you very much for the information.

ADD REPLYlink written 16 months ago by Vasu340

I download the CNV data by TCGAbiolinks and find no cloumn called is_ffpe. Is there a easy way to apply the rules to multiple aliquot ??

ADD REPLYlink written 13 months ago by Shixiang40

Check this if multiple aliquots exist

ADD REPLYlink written 13 months ago by Vasu340

Thanks, I wrote a function to solve this.

ADD REPLYlink written 13 months ago by Shixiang40

Thanks! Just to confirm: your function filters out FFPE? Does it do anything else?

ADD REPLYlink modified 13 months ago • written 13 months ago by Kevin Blighe46k
1

This function filter barcode following rules provided by broad institute, including Analyte Replicate Filter and Sort Replicate Filter two parts. I update the function to filter samples marked FFPE by broad and test it.

The 4th, 5th of barcode are FFPE samples.

> tcga_replicateFilter(tsb = c("TCGA-55-7913-01B-11H-2237-01", "TCGA-55-7913-01B-11R-2237-01", "TCGA-55-7913-01B-11T-2237-01", "TCGA-44-2656-01B-06D-A271-08", "TCGA-44-2656-01B-06D-A273-01"), analyte_target = "RNA")
ooo Filter barcodes successfully!
[1] "TCGA-55-7913-01B-11H-2237-01" "TCGA-44-2656-01B-06D-A273-01"
> tcga_replicateFilter(tsb = c("TCGA-55-7913-01B-11H-2237-01", "TCGA-55-7913-01B-11R-2237-01", "TCGA-55-7913-01B-11T-2237-01", "TCGA-44-2656-01B-06D-A271-08", "TCGA-44-2656-01B-06D-A273-01"), analyte_target = "RNA", filter_FFPE = TRUE, full_barcode = TRUE)
ooo Filter barcodes successfully!
[1] "TCGA-55-7913-01B-11H-2237-01"

BTW,

TCGA do not use FFPE sample to generate data related to DNA or RNA analyses.

TCGA only characterized samples that were frozen soon after surgery to prevent degradation of the RNA and DNA. FFPE (formalin fixed paraffin embedded) samples were not used because of potential changes to the RNA and DNA that may arise from the fixation process.

It seems that broad institute provide FFPE cases: http://gdac.broadinstitute.org/runs/sampleReports/latest/FPPP_FFPE_Cases.html, but the type all FPPP. It is no need to worry this.

ADD REPLYlink written 13 months ago by Shixiang40
1
gravatar for Shixiang
13 months ago by
Shixiang40
Shanghai
Shixiang40 wrote:

Following the rules from broad institute click, I wrote a R function to solve this problem and test it using LUAD tumor barcode.

I put it on Github gist tcga_replicateFilter.R due to word limit on Biostars. This function can help people have related questions.

I just check the data @bioinfo showed,

test_data = read_tsv("
                     TCGA-A6-2684-01A-01R-1410-07    
                     TCGA-A6-2684-01A-01R-A278-07    
                     TCGA-A6-6650-01A-11R-A278-07    
                     TCGA-A6-6650-01A-11R-1774-07    
                     TCGA-A6-2674-01A-02R-A278-07    
                     TCGA-A6-2674-01A-02R-0821-07    
                     TCGA-A6-3809-01A-01R-A278-07    
                     TCGA-A6-3809-01A-01R-1022-07    
                     TCGA-A6-6780-01A-11R-A278-07    
                     TCGA-A6-6780-01A-11R-1839-07    
                     TCGA-A6-3810-01A-01R-1022-07    
                     TCGA-A6-3810-01A-01R-A278-07    
                     TCGA-A6-5659-01A-01R-1653-07    
                     TCGA-A6-5659-01A-01R-A278-07    
                     TCGA-A6-6781-01A-22R-1928-07    
                     TCGA-A6-6781-01A-22R-A278-07    
                     TCGA-A6-5656-01A-21R-A278-07    
                     TCGA-A6-5656-01A-21R-1839-07", col_names=FALSE)
tcga_replicateFilter(test_data$X1)

Result is

> tcga_replicateFilter(test_data$X1)
ooo Filter barcodes successfully!
[1] "TCGA-A6-2684-01A-01R-A278-07" "TCGA-A6-6650-01A-11R-A278-07" "TCGA-A6-2674-01A-02R-A278-07"
[4] "TCGA-A6-3809-01A-01R-A278-07" "TCGA-A6-6780-01A-11R-A278-07" "TCGA-A6-3810-01A-01R-A278-07"
[7] "TCGA-A6-5659-01A-01R-A278-07" "TCGA-A6-6781-01A-22R-A278-07" "TCGA-A6-5656-01A-21R-A278-07"

If it is wrong, please let me know.

I also put it on Github

ADD COMMENTlink modified 7 months ago • written 13 months ago by Shixiang40
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 750 users visited in the last hour