Question

Samples with same TCGA barcode in TCGA data

5

Entering edit mode

6.1 years ago

Vasu ▴ 770

Hi,

I had downloaded TCGA RNA-seq data using tcgabiolinks package. In the data I see that there few samples with same barcode. Which sample should I prefer?

For eg:

TCGA-A6-2684-01A-01R-1410-07
TCGA-A6-2684-01A-01R-A278-07

From TCGA barcode I see that plate and center are different but the sample ID (TCGA-A6-2684-01) is same. Which one should I prefer for the analysis? Do I need to keep both the samples? When I consider sample ID it will be duplicate samples.

RNA-Seq tcga gdc • 7.9k views

ADD COMMENT • link updated 5.8 years ago by Shixiang ▴ 100 • written 6.1 years ago by Vasu ▴ 770

score 4 · Answer 1 · 2018-04-08

4

Entering edit mode

6.1 years ago

Kevin Blighe 87k

In the example that you have given, they are indeed the same sample and are just different aliquots from solid normal tissue:

To get that screenshot, I went to the GDC and put one of the sample names into the search box.

It is for cases like this that I wished that the maintainers of these TCGA R packages did more curation of their data, but I admit that it is a lot of data and that, as researchers, we may not even have funding left for such things.

With regard to which one you choose, you could just obtain the mean value of both, or literally just discard 1. Either way, to report this in the methods is a single line which will be brushed over by reviewers.

Kevin

ADD COMMENT • link 5.6 years ago by Kevin Blighe 87k

0

Entering edit mode

Thanks for the reply. I'm thinking to check the number of genes with zero read counts. I will select the sample having less number of genes with zero read counts for further analysis. Do you think this is a good idea?

ADD REPLY • link 6.1 years ago by Vasu ▴ 770

0

Entering edit mode

Yes, that is also a good idea.

ADD REPLY • link 6.1 years ago by Kevin Blighe 87k

0

Entering edit mode

I would also just check some of the other IDs that are like this, just to be sure.

ADD REPLY • link 6.1 years ago by Kevin Blighe 87k

0

Entering edit mode

Yes, in the first step I'm checking the sample ID in cbioportal and then in the next step I will check the gene count.

ADD REPLY • link 6.1 years ago by Vasu ▴ 770

score 3 · Answer 2 · 2018-04-08

3

Entering edit mode

6.1 years ago

svlachavas ▴ 790

Just to add a quick note to Kevin's helpful answer-because there can be various duplicated samples, in any conditions, as there are also various guidelines that you could take into account, from the BROAD institute, like the following example from the READ dataset:

http://gdac.broadinstitute.org/runs/sampleReports/latest/READ_Replicate_Samples.html

In detail, imagine you had the samples with the barcodes: "TCGA-A6-6650-01A-11R-1774-07" and "TCGA-A6-6650-01A-11R-A278-07" .

From their above approach, you should choose the latter because it has the aliquot with the later plate number.

Also, you could additionally remove any FFPE cases: http://gdac.broadinstitute.org/runs/sampleReports/latest/FPPP_FFPE_Cases.html

Hope that helps,

Efstathios

ADD COMMENT • link 6.1 years ago by svlachavas ▴ 790

0

Entering edit mode

Thanks for the information. You mean in the above example I need to select "TCGA-A6-6650-01A-11R-A278-07" ? plate number A278?

ADD REPLY • link 6.1 years ago by Vasu ▴ 770

0

Entering edit mode

Yes that is correct

ADD REPLY • link 6.1 years ago by svlachavas ▴ 790

0

Entering edit mode

Thanks but I have a doubt. Please check this one.

TCGA-A6-2684-01A-01R-1410-07    
TCGA-A6-2684-01A-01R-A278-07    

TCGA-A6-6650-01A-11R-A278-07    
TCGA-A6-6650-01A-11R-1774-07    

TCGA-A6-2674-01A-02R-A278-07    
TCGA-A6-2674-01A-02R-0821-07    

TCGA-A6-3809-01A-01R-A278-07    
TCGA-A6-3809-01A-01R-1022-07    

TCGA-A6-6780-01A-11R-A278-07    
TCGA-A6-6780-01A-11R-1839-07    

TCGA-A6-3810-01A-01R-1022-07    
TCGA-A6-3810-01A-01R-A278-07    

TCGA-A6-5659-01A-01R-1653-07    
TCGA-A6-5659-01A-01R-A278-07    

TCGA-A6-6781-01A-22R-1928-07    
TCGA-A6-6781-01A-22R-A278-07    

TCGA-A6-5656-01A-21R-A278-07    
TCGA-A6-5656-01A-21R-1839-07

In all these samples there is a plate number A278. So Do I need to select only those samples? And when I checked the Firebrowse (broadinst) data non of the samples are with plate number A278. What I need to do now?

ADD REPLY • link 6.1 years ago by Vasu ▴ 770

2

Entering edit mode

Dear Bioinfo,

you should really not have a concern about this. As i can see from above, each "double set" of barcodes belongs to one different patient. So you could follow the above approach. Take also a look in the following link, which mentions all the above issues, as the possibility of the specific samples that might not included. But still, this approach is valid as they recommend:

https://confluence.broadinstitute.org/display/GDAC/FAQ#FAQ-sampleTypesQWhatTCGAsampletypesareFirehosepipelinesexecutedupon

Keep also in mind, as you use the TCGAbiolinks R package, to perform a quick initial filtering in order to remove first any FFPE samples:

For example, consider the output of the GDCprepare function of your downloaded dataset, as an object named cancer_data:

data_filt <- cancer_data[,  !cancer_data$is_ffpe]

ADD REPLY • link 6.1 years ago by svlachavas ▴ 790

0

Entering edit mode

Thank you very much for the information.

ADD REPLY • link 6.1 years ago by Vasu ▴ 770

0

Entering edit mode

I download the CNV data by TCGAbiolinks and find no cloumn called is_ffpe. Is there a easy way to apply the rules to multiple aliquot ??

ADD REPLY • link 5.8 years ago by Shixiang ▴ 100

0

Entering edit mode

Check this if multiple aliquots exist

ADD REPLY • link 5.8 years ago by Vasu ▴ 770

0

Entering edit mode

Thanks, I wrote a function to solve this.

ADD REPLY • link 5.8 years ago by Shixiang ▴ 100

0

Entering edit mode

Thanks! Just to confirm: your function filters out FFPE? Does it do anything else?

ADD REPLY • link 5.8 years ago by Kevin Blighe 87k

1

Entering edit mode

This function filter barcode following rules provided by broad institute, including Analyte Replicate Filter and Sort Replicate Filter two parts. I update the function to filter samples marked FFPE by broad and test it.

The 4th, 5th of barcode are FFPE samples.

> tcga_replicateFilter(tsb = c("TCGA-55-7913-01B-11H-2237-01", "TCGA-55-7913-01B-11R-2237-01", "TCGA-55-7913-01B-11T-2237-01", "TCGA-44-2656-01B-06D-A271-08", "TCGA-44-2656-01B-06D-A273-01"), analyte_target = "RNA")
ooo Filter barcodes successfully!
[1] "TCGA-55-7913-01B-11H-2237-01" "TCGA-44-2656-01B-06D-A273-01"
> tcga_replicateFilter(tsb = c("TCGA-55-7913-01B-11H-2237-01", "TCGA-55-7913-01B-11R-2237-01", "TCGA-55-7913-01B-11T-2237-01", "TCGA-44-2656-01B-06D-A271-08", "TCGA-44-2656-01B-06D-A273-01"), analyte_target = "RNA", filter_FFPE = TRUE, full_barcode = TRUE)
ooo Filter barcodes successfully!
[1] "TCGA-55-7913-01B-11H-2237-01"

BTW,

TCGA do not use FFPE sample to generate data related to DNA or RNA analyses.

TCGA only characterized samples that were frozen soon after surgery to prevent degradation of the RNA and DNA. FFPE (formalin fixed paraffin embedded) samples were not used because of potential changes to the RNA and DNA that may arise from the fixation process.

It seems that broad institute provide FFPE cases: http://gdac.broadinstitute.org/runs/sampleReports/latest/FPPP_FFPE_Cases.html, but the type all FPPP. It is no need to worry this.

ADD REPLY • link 5.8 years ago by Shixiang ▴ 100

score 1 · Answer 3 · 2018-07-18

Following the rules from broad institute click, I wrote a R function to solve this problem and test it using LUAD tumor barcode.

I put it on Github gist tcga_replicateFilter.R due to word limit on Biostars. This function can help people have related questions.

I just check the data @bioinfo showed,

test_data = read_tsv("
                     TCGA-A6-2684-01A-01R-1410-07    
                     TCGA-A6-2684-01A-01R-A278-07    
                     TCGA-A6-6650-01A-11R-A278-07    
                     TCGA-A6-6650-01A-11R-1774-07    
                     TCGA-A6-2674-01A-02R-A278-07    
                     TCGA-A6-2674-01A-02R-0821-07    
                     TCGA-A6-3809-01A-01R-A278-07    
                     TCGA-A6-3809-01A-01R-1022-07    
                     TCGA-A6-6780-01A-11R-A278-07    
                     TCGA-A6-6780-01A-11R-1839-07    
                     TCGA-A6-3810-01A-01R-1022-07    
                     TCGA-A6-3810-01A-01R-A278-07    
                     TCGA-A6-5659-01A-01R-1653-07    
                     TCGA-A6-5659-01A-01R-A278-07    
                     TCGA-A6-6781-01A-22R-1928-07    
                     TCGA-A6-6781-01A-22R-A278-07    
                     TCGA-A6-5656-01A-21R-A278-07    
                     TCGA-A6-5656-01A-21R-1839-07", col_names=FALSE)
tcga_replicateFilter(test_data$X1)

Result is

> tcga_replicateFilter(test_data$X1)
ooo Filter barcodes successfully!
[1] "TCGA-A6-2684-01A-01R-A278-07" "TCGA-A6-6650-01A-11R-A278-07" "TCGA-A6-2674-01A-02R-A278-07"
[4] "TCGA-A6-3809-01A-01R-A278-07" "TCGA-A6-6780-01A-11R-A278-07" "TCGA-A6-3810-01A-01R-A278-07"
[7] "TCGA-A6-5659-01A-01R-A278-07" "TCGA-A6-6781-01A-22R-A278-07" "TCGA-A6-5656-01A-21R-A278-07"

If it is wrong, please let me know.

I also put it on Github