Question

How does Cellranger handle different versions of the same gene?

0

Entering edit mode

2.3 years ago

markddesimone ▴ 60

I am working with data previously processed with Cellranger v3.0, I do not have the original fastq files for these. For 21 of the genes, the genes.tsv file contains two versions of the same gene, e.g. ABCF2 and ABCF2.1 or TMSB15B and TMSB15B.1.

When I look at the gtf file used by cellranger (called genes.gtf and downloaded from https://cf.10xgenomics.com/supp/cell-exp/refdata-cellranger-GRCh38-3.0.0.tar.gz) it contains these entries for those genes in the info column (which I've simplified for clarity here):

'gene_id "ENSG00000285292"'; ' gene_version "1"'; ' gene_name "ABCF2"']
'gene_id "ENSG00000033050"'; ' gene_version "8"'; ' gene_name "ABCF2"']
'gene_id "ENSG00000158427"'; ' gene_version "14"'; ' gene_name "TMSB15B"']
'gene_id "ENSG00000269226"'; ' gene_version "7"'; ' gene_name "TMSB15B"']

Note ABCF2.1 and TMSB15B.1 are not in the gtf file

Cellranger maps to ABCF2, ABCF2.1, TMSB15B, TMSB15B.1 etc. but I can't tell which ensembl gene_id these correspond to. I'm wanting to compare the results with RNA-Seq data processed with STAR/RSEM from the same gtf so therefore I'd like to know which ENSG the names correspond to.

I am wondering if Cellranger reads the gtf sequentially and creates the suffix '.1.' when a second version of the same gene_name is encountered such that:

ENSG00000285292 --> ABCF2
ENSG00000033050 --> ABCF2.1
ENSG00000158427 --> TMSB15B
ENSG00000269226 --> TMSB15B.1

or whether gene_version has priority... note that for TMSB15B version 14 precedes version 7 in the gtf so under that rule the order would be reversed:

ENSG00000285292 --> ABCF2
ENSG00000033050 --> ABCF2.1
ENSG00000158427 --> TMSB15B.1
ENSG00000269226 --> TMSB15B

Can anyone help clarify this?

Thank you

Cellranger • 984 views

ADD COMMENT • link 2.3 years ago by markddesimone ▴ 60

0

Entering edit mode

Modern Cell Ranger outputs a features.tsv file with three columns: gene_id, gene_name, and assay. The gene_names don't get modified since they are associated with a unique gene_id. Do you know if your genes file could have possibly been created in some downstream processing step?

ADD REPLY • link 2.3 years ago by rpolicastro 13k

1

Entering edit mode

rpolicastro thank you for that. Unfortunately, the data I have (which was downloaded from https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE183904) has just the output matrix.mtx, barcodes.tsv and genes.tsv for each sample.

The paper associated with the data archive states:

Cellranger v3.0 (https://support.10xgenomics.com/single-cell-gene- expression/software/) was used to align FASTQ sequencing reads to the hg38 reference transcriptome, generating single cell feature counts for each sample.

The cellranger release notes (https://support.10xgenomics.com/single-cell-gene-expression/software/release-notes/3-0) for 3.0 state:

The genes.tsv file has been renamed features.tsv.gz, and contains extra columns indicating the feature_type of each gene / feature.

This implies that the genes.tsv was not produced by cellranger 3.0 but was perhaps produced by the authors from features.tsv. as you suggest. I will try follow up with them.

thanks again for your help.

ADD REPLY • link 2.3 years ago by markddesimone ▴ 60