Question

Correct biomart version for TCGA data

0

Entering edit mode

3 months ago

ramiro.barrantes • 0

We would like to do some analyses finding variants in tumor/normal pairs from TCGA. However, I think I am doing something wrong as sometimes the coordinates are not quite accurate. Would appreciate any suggestions.

First we downloaded reference genomes from https://gdc.cancer.gov/about-data/gdc-data-processing/gdc-reference-files
Ran sarek to determine variants and pointing to the TCGA genome GRCh38.d1.vd1.fa.tar.gz (fetched from the website in (1) )
Built a database for SnpEff using SnpEff built using the gencode annotation from TCGA gencode.v36.annotation.gtf.gz and the same genome as (2)
Then we ran SnpEff and annotated with the gene information from what was built in (3)

However, if I try to fetch a sequence, say ENST00000518579.5, using biomart:

genome=useEnsembl(biomart="ensembl",version=102,dataset = "hsapiens_gene_ensembl")
referenceProteinSequence <- biomaRt::getBM(attributes= c("peptide",  "hgnc_symbol","protein_id"),
                                               filters=”ensembl_transcript_id",
                                               values = “ENST00000518579”,
                                               mart=genome)

The variant that is found in Sarek says that the reference is R167, but that amino acid in the reference protein sequence from BioMart is a T:

strsplit(referenceProteinSequence$peptide,"")[[1]][167]   
“T”

This does not happen all the time, but sometimes. Do you at all have any kind of tip as to how I can figure out what is happening? I think I am probably pointing to the wrong genome version in biomart? Any help appreciated.

TCGA sarek biomart • 365 views

ADD COMMENT • link updated 3 months ago by Maxime Garcia ▴ 340 • written 3 months ago by ramiro.barrantes • 0

score 1 · Answer 1 · 2024-02-03

Hi, we saw your question on the nf-core slack, and one of our nicest contributor looked into it. Since he doesn't have a biostars account (yet), here's his reply:

I've looked into this, and there seems to be some confusion. There is a question in question in the title, i.e. the "correct BioMart version for TCGA", which I don't think makes sense, as TCGA does not enter the picture here, except as attributed source of the genome. The genome itself is a bundle of various decoys, which are not relevant here, and GRCh38, the base sequence of which is stable (i.e. still the same a it was over ten years ago). Then there is the apparent discrepancy between an annotation by snpEff and looking the variant up in the Ensembl system (BioMart is being used, but that is merely an aggregator of information from inside and outside the Ensembl universe; the transcript and other information quotes is all from Ensembl). The poster says that the reference residue at position 167 of the protein translation of the Ensembl transcript he looked up is threonine (T; this appears to be consistent with the current Enseml release, see https://www.ensembl.org/Homo_sapiens/Transcript/ProtVariations?db=core;g=ENSG00000070501;r=8:42357174-42371752;t=ENST00000518579 and scroll to residue 167), whereas apparently snpEff identifies the residue as arginine (R). A big question mark here is that this "comparison" takes place in protein space, i.e. a translation of the nucleotide sequence. Since most genes have several transcripts and each transcript has potentially its own translation to an amino acid sequence, choice of different transcripts by the poster and by snpEff can account for this difference in two ways: The different aa sequence causes the residue with its codon at the same genomic coordinate (where the variant is) to have a different number in the polypeptide chain, so referring to residue 167 in a different protein (of the same gene) really is a different place in the genome. This is most likely the issue here. A different transcript could also cause the protein sequence to change completely (due to a frameshift), so while the residue might be approximately at the same position, the translation changed. I believe this is relatively rare. The poster does not give details about what transcript/protein snpEff used, but since his choice of transcript was one of eight possible ones, and not the "canonical" one, it was most likely different. The question "which Ensembl/BioMart version corresponds best to my snpEff annotation" would be valid and potentially relevant. The answer is, if you look at the cache version the snpEff used, e.g. "GRCh38.105", that number at the end in bold corresponds to the Ensembl release, and on the Ensembl website you can go to the corresponding archive page (or use the API of that version). For well established genes, though, this is pretty stable and the exact version likely does not matter (as long as the same genome assembly is being used).

Do not hesitate to come ask questions on the nf-core slack, we have a dedicated channel for Sarek (#sarek) : https://nf-co.re/join