Question

How do I match my transcript ID's from NCBI to the corresponding gene ID's to enable tximport into R?

0

Entering edit mode

2.7 years ago

robeaumont • 0

Hi all,

New to RNA-Seq analysis and I tried to find an answer to this elsewhere. I have performed salmon alignment on my pair-end fastq files which has generated the quant.sf files for all my samples. I now need to import into R with tximport which requires a dataframe with transcript ID's and the corresponding gene ID's. My transcriptomic .fasta reference file is from NCBI which does not contain the gene ID's, only the NM_transcript identifiers.

How do I find the gene ID's I need, and in the correct order? The reference transcriptome I have been working from is Thoroughbred racehorse (EquCab3.0).

Thanks in advance!

Salmon tximport RNA-Seq R • 3.4k views

ADD COMMENT • link updated 21 months ago by GenoMax 141k • written 2.7 years ago by robeaumont • 0

0

Entering edit mode

Gene names appear to be in the fasta headers. In () brackets.

$ zmore GCF_002863925.1_EquCab3.0_rna.fna.gz | grep "^>" | head -10
>NM_001081757.1 Equus caballus 2'-5'-oligoadenylate synthetase 3 (OAS3), mRNA
>NM_001081758.1 Equus caballus myosin heavy chain 7 (MYH7), mRNA
>NM_001081759.1 Equus caballus myosin heavy chain 1 (MYH1), mRNA
>NM_001081760.1 Equus caballus myosin heavy chain 2 (MYH2), mRNA
>NM_001081761.1 Equus caballus sodium voltage-gated channel alpha subunit 4 (SCN4A), mRNA
>NM_001081762.1 Equus caballus eukaryotic translation initiation factor 4 gamma 3 (EIF4G3), mRNA
>NM_001081763.1 Equus caballus ATP binding cassette subfamily C member 1 (ABCC1), mRNA
>NM_001081764.1 Equus caballus collagen type II alpha 1 chain (COL2A1), mRNA
>NM_001081765.1 Equus caballus ATPase sarcoplasmic/endoplasmic reticulum Ca2+ transporting 2 (ATP2A2), mRNA
>NM_001081766.1 Equus caballus solute carrier family 4 member 2 (SLC4A2), mRNA

Edit: Removing the code because of the problem noted by @ATPoint below. Parsing should still be feasible but will need a program.

ADD REPLY • link 2.7 years ago by GenoMax 141k

0

Entering edit mode

Careful with that, I came across situations where RefSeq has brackets in those headers as part of the gene description rather than for delimiting the gene name, e.g. (dummy example):

>NM_123.4 My Organism desoxygenase (fancy β-subunit) synthetase 3 (OAS3), mRNA

and there goes parsing these gene names...

Edit:

Here for example (real data):

>NM_001081881.1 Equus caballus granzyme B (granzyme 2, cytotoxic T-lymphocyte-associated serine esterase 1) (GZMB), mRNA

ADD REPLY • link 2.7 years ago by ATpoint 82k

GenoMax · Answer 1 · 2021-08-16

1

Entering edit mode

2.7 years ago

h.mon 35k

One way is to use makeTxDbFromGFF() (from the package GenomicFeatures) to build a TxDb from the EquCab gff3 or gtf annotation, then proceed as indicated in the tximport documentation:

library( GenomicFeatures )
txdb <- makeTxDbFromGFF( file = "/path/to/GCF_002863925.1_EquCab3.0_genomic.gtf.gz", format = "gtf" )
k <- keys( txdb, keytype = "TXNAME" )
tx2gene <- select( txdb, k, "GENEID", "TXNAME" )

ADD COMMENT • link 2.7 years ago by h.mon 35k

0

Entering edit mode

Hi h.mon, thanks for the suggestion, this worked nicely for me. :) I downloaded the RefSeq transcript fasta and RefSeq gff file from NCBI (https://www.ncbi.nlm.nih.gov/genome/guide/human/); did my Salmon quantification with the RefSeq fasta and created the tx2gene file with the RefSeq gff, as proposed by you. After running tximport

txi <- tximport(files, type="salmon", tx2gene=tx2gene, ignoreTxVersion= TRUE, countsFromAbundance="lengthScaledTPM")

I get an error:

reading in files with read_tsv 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15  removing duplicated transcript rows from tx2gene Error in .local(object, ...) :    None of the transcripts in the quantification files are present   in the first column of tx2gene. Check to see that you are using   the same annotation for both.

Example IDs (file): [NM_000014, NM_000015, NM_000016, ...]

Example IDs (tx2gene): [NR_046018.2, XR_007065314.1, NR_036051.1, ...]

  This can sometimes (not always) be fixed using 'ignoreTxVersion' or 'ignoreAfterBar'.

If I use ignoreAfterBar instead of ignoreTxVersion, I don´t get the error anymore, but "transcripts missing from tx2gene: 78557". Would you have an explanation, how this happens or what I can do to prevent it? The annotation file should have all transcript accessions, or not?

Thanks a lot in advance :)

ADD REPLY • link updated 21 months ago by GenoMax 141k • written 21 months ago by ella • 0

score 1 · Answer 2 · 2021-08-16

1

Entering edit mode

2.7 years ago

ATpoint 82k

Usually one parses such a table from a reference GTF file. With your genome I'd do something like:

library(rtracklayer)

#/ download: wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/002/863/925/GCF_002863925.1_EquCab3.0/GCF_002863925.1_EquCab3.0_genomic.gtf.gz
gtf <- rtracklayer::import("~/GCF_002863925.1_EquCab3.0_genomic.gtf.gz")

#/ get a unique transcript2gene mapping table.
tx2gene <- unique(data.frame(gtf[gtf$type=="exon"])[,c("transcript_id", "gene_id")])

> head(tx2gene)


|   |transcript_id  |gene_id      |
|:--|:--------------|:------------|
|1  |XM_005602135.3 |SYCE1        |
|14 |XM_005602137.3 |SYCE1        |
|26 |XM_023636208.1 |LOC111772506 |
|33 |XM_023636216.1 |LOC111772506 |
|39 |XM_023636230.1 |LOC111772506 |
|45 |XM_023636202.1 |LOC111772506 |

ADD COMMENT • link 2.7 years ago by ATpoint 82k

0

Entering edit mode

Super minor point but you could filter for 'transcript' type to avoid having to call unique.

ADD REPLY • link 2.7 years ago by rpolicastro 13k

0

Entering edit mode

I know, but this type does not exist in this GTF, that is why I use the exon workaround :)

> unique(gtf$type)
[1] gene        exon        CDS         start_codon stop_codon 
Levels: gene exon CDS start_codon stop_codon

ADD REPLY • link 2.7 years ago by ATpoint 82k

0

Entering edit mode

Haha, figures. 'transcript' is supposed to be defined in GTF files but everyone sort of does their own thing.

ADD REPLY • link 2.7 years ago by rpolicastro 13k

0

Entering edit mode

It's an NCBI thing apparently, same with these gene names in brackets rather than using unique delimiters in their fasta files as GENCODE (|) does.