Question

Missing gene symbols in bustools output

0

Entering edit mode

4.2 years ago

Barry Digby ★ 1.3k

Hi, quick question,

For the bustools output, is the <prefix>.genes.txt file supposed to contain matching gene symbols in the output? Reading this galaxy tutorial suggests cellranger gene output files have corresponding symbols. It seems logical that bustools would do the same (I mean why not, it has all the information it needs to do so). However, for 2 separate analyses, my file has contained ENSG ID's only.

I have tried a few ways to fix this -- stripping id version from transcripts.txt of kallisto bus output and the transcripts_to_genes.txt file supplied to bustools count did not fix it.

Subsetting a file containing ensembl_gene_id_version and external_gene_name using the <prefix>.genes.txt file in R (bioMart query) resulted in a shorter output file, which scanpy threw an error when attempting to load it into python.

The end goal is to put these files into scanpy for analysis, but I'm not sure at what stage in the pipeline I am supposed to diagnose this.

Thanks in advance,

Barry

bustools scanpy kallisto single-cell scRNA • 2.2k views

ADD COMMENT • link updated 21 months ago by s.koturan • 0 • written 4.2 years ago by Barry Digby ★ 1.3k

0

Entering edit mode

A couple of points on the content of your post:

There is no need to add (scRNA) (with the parentheses) to the title, mentioning it in the body (and adding a tag) will suffice.
The way to format content as code in-line is to use one back-quote, not three. Three backquotes are used to format a block. Please use `text`, not ```text```

ADD REPLY • link 4.2 years ago by Ram 43k

0

Entering edit mode

Hi Barry, did you resolve this issue? I'm facing the same problem- any suggestions? Thanks!

ADD REPLY • link 21 months ago by s.koturan • 0

score 0 · Answer 1 · 2020-01-24

0

Entering edit mode

4.2 years ago

pmelsted ▴ 120

The transcript_to_genes.txt file might contain a third column with the gene symbol, depending on how it was generated.

This script https://github.com/BUStools/BUS_notebooks_python/blob/master/utils/transcript2gene.py can generate a t2g file from a GTF, if you want version names you can use the --use-versions option.

You should use a GTF that matches the FASTA file used to construct the index (this way the versions in both transcript and gene ids match with versions).

The problem with relying solely on gene symbols is that they are not in 1-1 correspondence with gene symbols. For example ATXN7 corresponds to the gene ids ENSG00000285258, and ENSG00000163635 which are overlapping on chromosome 3. In other cases you can have the same gene on a single chromosome and it's representation on alternate haplotypes (if you use the chr_patch_hapl_scaff GTF file from ensembl). A straight forward solution is to aggregate the gene counts reported by gene symbol.

If you are going to use scanpy anyway, there is a simple solution to use var_names_make_unique(), as shown here https://icb-scanpy.readthedocs-hosted.com/en/stable/api/scanpy.datasets.pbmc3k.html#scanpy-datasets-pbmc3k, but it seems you can run the analysis in scanpy using gene ids as well.

ADD COMMENT • link 4.2 years ago by pmelsted ▴ 120

0

Entering edit mode

Thanks for the response Páll.

I'll integrate the py script into my nextflow script and hopefully get to the bottom of this over the weekend.

I've noticed the ensembl GTF file uses version Id's but the ensembl cDNA file does not. Annoying discrepancy. Anyway I'll get to work and post a solution to this page for anyone else who has the same problem. Thanks for your time :)

ADD REPLY • link 4.2 years ago by Barry Digby ★ 1.3k

0

Entering edit mode

I think I may not have been very clear. I am referring specifically to the genes.txt output from bustools count. In the second link you provided me, the dataset has a genes.tsv file, that has both ENSG ID + symbols. In my own attempts using bustools count, my output has only ENSG ID. This makes your suggestion of adata.var_names_make_unique() # this is unnecessary if using 'gene_ids' inapplicable to my output

I have been extremely careful to match the indexed genome with the annotation files. I used gffread on ensembl's gff3 file to construct a transcriptome. This results in a transcriptome fasta file with "transcript_version=2", matching the ensembl gtf file I use for your tx2gene.py script. (using gffread on the gtf file omits gene_id for some reason). Using your tx2gene.py script on the gtf file, I get a transcripts_to_genes.txt with no version ID's, just like the gff, gtf and cDNA file.

I have also ensured the transcripts.txt file has no ID version as the -t option for bustools count.

(conversely; I have also used gencode.v33 references to make sure version ID's were present in all files mentioned above)

Is there some step I am missing to get gene ID's annotated with symbols for downstream analysis?

ADD REPLY • link 4.2 years ago by Barry Digby ★ 1.3k