Why does assigning genes with biomart give me different values than using a transcripts_to_genes.txt file?
Entering edit mode
5 weeks ago
bioinfo ▴ 150


I run kallisto on my data and I am in the process of assigning gene names to my data. I tried to do this in 2 different ways but I get different results. The first way I tried is shown below using the t2g.py from https://github.com/pachterlab/kallisto-transcriptome-indices/releases:

#Create the transcripts_to_genes file
python t2g.py --use_version <Homo_sapiens.GRCh38.111.gtf> transcripts_to_genesv111.txt

#Assign gene names:

  t2g <- read.delim("/data/transcripts_to_genesv111", sep="", header=FALSE)
  txi.kallisto <- tximport(tsv_files, type = "kallisto", tx2gene = t2g[,c(1,3)], ignoreTxVersion = FALSE)

The second way I was using biomart:

mart <- biomaRt::useMart("ensembl", "hsapiens_gene_ensembl", host= "https://jan2024.archive.ensembl.org")
t2g <- biomaRt::getBM(attributes = c("ensembl_transcript_id", "external_gene_name", "ensembl_gene_id" ), mart = mart)
t2g <- dplyr::rename(t2g, target_id = ensembl_transcript_id, ext_gene = external_gene_name)

I noticed that the t2g file I created using biomart gives me different results.

For example for gene ZSCAN2 the t2g file created the first way seems to be associated with gene id ENSG00000176371.14. In the t2g file created with biomart is is associated with ENSG00000176371.14 and ENSG00000291625.1 . ENSG00000291625.1 is not in the gtf file from that version but it is in Homo_sapiens.GRCh38.cdna.all.fa.gz as shown below:

ENST00000708196.1 cdna scaffold:GRCh38:HG2280_PATCH:1092832:1112495:1 gene:ENSG00000291625.1 gene_biotype:protein_coding transcript_biotype:protein_coding gene_symbol:ZSCAN2 description:zinc finger and SCAN domain containing 2 [Source:HGNC Symbol;Acc:HGNC:20994]

What is the best way to assign the gene names? Also is is better to assign "external_gene_name" or "ensembl_gene_id" for the output?

Thank you

biomart RNAseq kallisto • 321 views
Entering edit mode

According to HUGO there is only one ZSCAN2 gene and it points to the first Ensembl gene ID (ENSG00000176371). Ensembl is annotating another copy on a scaffold patch so there seems to be some new information here that would need to be resolved over time.

Entering edit mode

Thank you! I ended up using the transcripts_to_genes instead of biomart.


Login before adding your answer.

Traffic: 2121 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6