TCGA data Query (GDCquery): external_gene_name " are missing
1
0
Entering edit mode
12 months ago

Hi, I just got some weird output from TCGA dataset. As you can see in the below picture, some of the "external_gene_name " are missing. Would you please help me out with this issue? Thank you.


query.seq <- GDCquery(project = "TCGA-BRCA",
data.category = "Transcriptome Profiling",
data.type = "Gene Expression Quantification",
sample.type = c("Solid Tissue Normal", "Primary Tumor"),
workflow.type = "HTSeq - Counts")

seq.brca <- GDCprepare(query = query.seq, summarizedExperiment = TRUE)

0
Entering edit mode
12 months ago
GenoMax 117k

ENSG00000281904 is annotated as novel gene so that is why you have no official gene name. This gene was manually annotated by Ensembl. Others may be similar.

0
Entering edit mode
0
Entering edit mode

Thank you, So do you mean that I can neglect them for my analysis? Actually, when I used the "gencode.gene.info.v22.csv" file from TCGA, it has assigned some name to them (highlighted part in the first picture attached).

But on the other hand, my friend get the exact name of the genes one year ago by "gencode.gene.info.v22.csv", but they are not the same in figure 1, I mean they have aliases. for example;

RP11-418H16.1 = AC007389.5

CH17-132F21.5= AC233263.6

So I'm wondering how can I get the same gene names "AC007389.5 and AC233263.6 , ... " ?

0
Entering edit mode

I mean, are you realistically interested in genes like these? They probably even have 0 counts across all of your samples. Unless you are specifically studying low-expressed predicted genes, then maybe just filter these out.

0
Entering edit mode

Thanks again. Yes, I need them to use in my analysis if I could get the gene names such as "AC007389.5" instead of "RP11-418H16.1" as I mentioned above.