I am working with data previously processed with Cellranger v3.0, I do not have the original fastq files for these. For 21 of the genes, the genes.tsv file contains two versions of the same gene, e.g. ABCF2 and ABCF2.1 or TMSB15B and TMSB15B.1.
When I look at the gtf file used by cellranger (called genes.gtf and downloaded from https://cf.10xgenomics.com/supp/cell-exp/refdata-cellranger-GRCh38-3.0.0.tar.gz) it contains these entries for those genes in the info column (which I've simplified for clarity here):
'gene_id "ENSG00000285292"'; ' gene_version "1"'; ' gene_name "ABCF2"'] 'gene_id "ENSG00000033050"'; ' gene_version "8"'; ' gene_name "ABCF2"'] 'gene_id "ENSG00000158427"'; ' gene_version "14"'; ' gene_name "TMSB15B"'] 'gene_id "ENSG00000269226"'; ' gene_version "7"'; ' gene_name "TMSB15B"']
Note ABCF2.1 and TMSB15B.1 are not in the gtf file
Cellranger maps to ABCF2, ABCF2.1, TMSB15B, TMSB15B.1 etc. but I can't tell which ensembl gene_id these correspond to. I'm wanting to compare the results with RNA-Seq data processed with STAR/RSEM from the same gtf so therefore I'd like to know which ENSG the names correspond to.
I am wondering if Cellranger reads the gtf sequentially and creates the suffix '.1.' when a second version of the same gene_name is encountered such that:
ENSG00000285292 --> ABCF2 ENSG00000033050 --> ABCF2.1 ENSG00000158427 --> TMSB15B ENSG00000269226 --> TMSB15B.1
or whether gene_version has priority... note that for TMSB15B version 14 precedes version 7 in the gtf so under that rule the order would be reversed:
ENSG00000285292 --> ABCF2 ENSG00000033050 --> ABCF2.1 ENSG00000158427 --> TMSB15B.1 ENSG00000269226 --> TMSB15B
Can anyone help clarify this?