deleting TCGA replicated samples
2
0
Entering edit mode
4.3 years ago
. • 0

hi everyone, im working on TCGA data. i want to have unique samples, but there are replicates in my samples and i dont know how to do this. i dont know whether getting median for the replicated samples is appropriate or not(because for solid tumors the 2 samples i try to get median for, might have completely different spatial heterogeneity).

RNA-Seq TCGA aggregate • 1.9k views
ADD COMMENT
0
Entering edit mode

thanks for answering. but because i want to integrate my data with protein data, i have to use a part of the TCGA barcode(the third part that is for "participant") e.g: TCGA-02-0001-01C-01D-0182-01: in this barcode 0001 is for participant that i should get.

ADD REPLY
0
Entering edit mode

Specifically what data are you working on? Where do you get the data from? Could you post an example of a duplicated sample id?

ADD REPLY
0
Entering edit mode
4.3 years ago
MatthewP ★ 1.4k

Don't get median value. You need to select one of them. Maybe you need to get full tcga barcode or more other information like is_ffpe or not to help you select only sample.

This page explains TCGA barcode. You need to download other relative files like _MANIFEST.txt_, _metadata file_ where you can get more information about your sample/data.
Example full barcode from metadata

  "associated_entities": [
    {
      "entity_id": "90e6e8a1-98b3-4f38-92ef-df460d78d657", 
      "case_id": "ada19f65-5256-4c79-b3b9-7b9da69be437", 
      "entity_submitter_id": "TCGA-E7-A97Q-01A-11R-A38B-07", 
      "entity_type": "aliquot"
    }
  ],
ADD COMMENT
0
Entering edit mode
4.3 years ago
. • 0

hi. i got the mRNA data from TCGA by R code(the FPKM data), and the protein data from TCPA; and an example of my duplicated data is like below: TCGA-HZ-A9TJ-01A-11R-A41I-07 TCGA-HZ-A9TJ-06A-11R-A41B-07

TCGA-H6-A45N-01A-11R-A26U-07 TCGA-H6-A45N-11A-12R-A26U-07

the R code that i got data with is below: library(TCGAbiolinks) library(dplyr) library(DT) library(SummarizedExperiment)

1

query1 <- GDCquery(project = "TCGA-PAAD", data.category = "Transcriptome Profiling", data.type = "Gene Expression Quantification", workflow.type = "HTSeq - Counts")

workflow.type = "HTSeq - FPKM-UQ"

df <- GDCprepare(query1, save=TRUE, save.filename = "TCGA-PAAD_dataframe.rda", summarizedExperiment = FALSE)
write.csv(df, file = "count.csv")

2

query <- GDCquery(project = "TCGA-PAAD", data.category = "Transcriptome Profiling", data.type = "Gene Expression Quantification", workflow.type = "HTSeq - Counts")

Download a list of barcodes with platform IlluminaHiSeq_RNASeqV2

GDCdownload(query)

Prepare expression matrix with geneID in the rows and samples (barcode) in the columns

rsem.genes.results as values

PAADRnaseqSE <- GDCprepare(query)

PAADMatrix <- assay(PAADRnaseqSE,"HTSeq - Counts") # or PAADMatrix <- assay(PAADRnaseqSE,"raw_count")

For gene expression if you need to see a boxplot correlation and AAIC plot to define outliers you can run

PAADRnaseq_CorOutliers <- TCGAanalyze_Preprocessing(PAADRnaseqSE)

quantile filter of genes

dataFilt <- TCGAanalyze_Filtering(tabDF = PAADRnaseq_CorOutliers, method = "quantile", qnt.cut = 0.25)

selection of normal samples "NT"

samplesNT <- TCGAquery_SampleTypes(barcode = colnames(dataFilt), typesample = c("NT"))

selection of tumor samples "TP"

samplesTP <- TCGAquery_SampleTypes(barcode = colnames(dataFilt), typesample = c("TP"))

Diff.expr.analysis (DEA)

dataDEGs <- TCGAanalyze_DEA(mat1 = dataFilt[,samplesNT], mat2 = dataFilt[,samplesTP], Cond1type = "Normal", Cond2type = "Tumor", fdr.cut = 0.01 , logFC.cut = 1, method = "glmLRT")

DEGs table with expression values in normal and tumor samples

dataDEGsFiltLevel <- TCGAanalyze_LevelTab(dataDEGs,"Tumor","Normal", dataFilt[,samplesTP],dataFilt[,samplesNT]) write.csv(dataDEGsFiltLevel, file = "DEGs.csv")

<h6>#########################################</h6>
ADD COMMENT
0
Entering edit mode

Please update your question with code instead of supplying it as a answer. Also use the propper formatting of code instead of pasting it to ensure readability.

ADD REPLY

Login before adding your answer.

Traffic: 1569 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6