Question

Why does assigning genes with biomart give me different values than using a transcripts_to_genes.txt file?

0

Entering edit mode

10 weeks ago

bioinfo ▴ 150

Hello,

I run kallisto on my data and I am in the process of assigning gene names to my data. I tried to do this in 2 different ways but I get different results. The first way I tried is shown below using the t2g.py from https://github.com/pachterlab/kallisto-transcriptome-indices/releases:

#Create the transcripts_to_genes file
python t2g.py --use_version <Homo_sapiens.GRCh38.111.gtf> transcripts_to_genesv111.txt

#Assign gene names:

  t2g <- read.delim("/data/transcripts_to_genesv111", sep="", header=FALSE)
  head(t2g)
  txi.kallisto <- tximport(tsv_files, type = "kallisto", tx2gene = t2g[,c(1,3)], ignoreTxVersion = FALSE)

The second way I was using biomart:

mart <- biomaRt::useMart("ensembl", "hsapiens_gene_ensembl", host= "https://jan2024.archive.ensembl.org")
t2g <- biomaRt::getBM(attributes = c("ensembl_transcript_id", "external_gene_name", "ensembl_gene_id" ), mart = mart)
t2g <- dplyr::rename(t2g, target_id = ensembl_transcript_id, ext_gene = external_gene_name)

I noticed that the t2g file I created using biomart gives me different results.

For example for gene ZSCAN2 the t2g file created the first way seems to be associated with gene id ENSG00000176371.14. In the t2g file created with biomart is is associated with ENSG00000176371.14 and ENSG00000291625.1 . ENSG00000291625.1 is not in the gtf file from that version but it is in Homo_sapiens.GRCh38.cdna.all.fa.gz as shown below:

ENST00000708196.1 cdna scaffold:GRCh38:HG2280_PATCH:1092832:1112495:1 gene:ENSG00000291625.1 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:ZSCAN2 description:zinc finger and SCAN domain containing 2 [Source:HGNC Symbol;Acc:HGNC:20994]

What is the best way to assign the gene names? Also is is better to assign "external_gene_name" or "ensembl_gene_id" for the output?

Thank you

biomart RNAseq kallisto • 367 views

ADD COMMENT • link 10 weeks ago by bioinfo ▴ 150

2

Entering edit mode

According to HUGO there is only one ZSCAN2 gene and it points to the first Ensembl gene ID (ENSG00000176371). Ensembl is annotating another copy on a scaffold patch so there seems to be some new information here that would need to be resolved over time.