How do I match my transcript ID's from NCBI to the corresponding gene ID's to enable tximport into R?
2
0
Entering edit mode
5 months ago
robeaumont • 0

Hi all,

New to RNA-Seq analysis and I tried to find an answer to this elsewhere. I have performed salmon alignment on my pair-end fastq files which has generated the quant.sf files for all my samples. I now need to import into R with tximport which requires a dataframe with transcript ID's and the corresponding gene ID's. My transcriptomic .fasta reference file is from NCBI which does not contain the gene ID's, only the NM_transcript identifiers.

How do I find the gene ID's I need, and in the correct order? The reference transcriptome I have been working from is Thoroughbred racehorse (EquCab3.0).

Thanks in advance!

Salmon tximport RNA-Seq R • 725 views
ADD COMMENT
0
Entering edit mode

Gene names appear to be in the fasta headers. In () brackets.

$ zmore GCF_002863925.1_EquCab3.0_rna.fna.gz | grep "^>" | head -10
>NM_001081757.1 Equus caballus 2'-5'-oligoadenylate synthetase 3 (OAS3), mRNA
>NM_001081758.1 Equus caballus myosin heavy chain 7 (MYH7), mRNA
>NM_001081759.1 Equus caballus myosin heavy chain 1 (MYH1), mRNA
>NM_001081760.1 Equus caballus myosin heavy chain 2 (MYH2), mRNA
>NM_001081761.1 Equus caballus sodium voltage-gated channel alpha subunit 4 (SCN4A), mRNA
>NM_001081762.1 Equus caballus eukaryotic translation initiation factor 4 gamma 3 (EIF4G3), mRNA
>NM_001081763.1 Equus caballus ATP binding cassette subfamily C member 1 (ABCC1), mRNA
>NM_001081764.1 Equus caballus collagen type II alpha 1 chain (COL2A1), mRNA
>NM_001081765.1 Equus caballus ATPase sarcoplasmic/endoplasmic reticulum Ca2+ transporting 2 (ATP2A2), mRNA
>NM_001081766.1 Equus caballus solute carrier family 4 member 2 (SLC4A2), mRNA

Edit: Removing the code because of the problem noted by @ATPoint below. Parsing should still be feasible but will need a program.

ADD REPLY
0
Entering edit mode

Careful with that, I came across situations where RefSeq has brackets in those headers as part of the gene description rather than for delimiting the gene name, e.g. (dummy example):

>NM_123.4 My Organism desoxygenase (fancy β-subunit) synthetase 3 (OAS3), mRNA

and there goes parsing these gene names...

Edit:

Here for example (real data):

>NM_001081881.1 Equus caballus granzyme B (granzyme 2, cytotoxic T-lymphocyte-associated serine esterase 1) (GZMB), mRNA
ADD REPLY
1
Entering edit mode
5 months ago
ATpoint 57k

Usually one parses such a table from a reference GTF file. With your genome I'd do something like:

library(rtracklayer)

#/ download: wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/002/863/925/GCF_002863925.1_EquCab3.0/GCF_002863925.1_EquCab3.0_genomic.gtf.gz
gtf <- rtracklayer::import("~/GCF_002863925.1_EquCab3.0_genomic.gtf.gz")

#/ get a unique transcript2gene mapping table.
tx2gene <- unique(data.frame(gtf[gtf$type=="exon"])[,c("transcript_id", "gene_id")])

> head(tx2gene)


|   |transcript_id  |gene_id      |
|:--|:--------------|:------------|
|1  |XM_005602135.3 |SYCE1        |
|14 |XM_005602137.3 |SYCE1        |
|26 |XM_023636208.1 |LOC111772506 |
|33 |XM_023636216.1 |LOC111772506 |
|39 |XM_023636230.1 |LOC111772506 |
|45 |XM_023636202.1 |LOC111772506 |
ADD COMMENT
0
Entering edit mode

Super minor point but you could filter for 'transcript' type to avoid having to call unique.

ADD REPLY
0
Entering edit mode

I know, but this type does not exist in this GTF, that is why I use the exon workaround :)

> unique(gtf$type)
[1] gene        exon        CDS         start_codon stop_codon 
Levels: gene exon CDS start_codon stop_codon
ADD REPLY
0
Entering edit mode

Haha, figures. 'transcript' is supposed to be defined in GTF files but everyone sort of does their own thing.

ADD REPLY
0
Entering edit mode

It's an NCBI thing apparently, same with these gene names in brackets rather than using unique delimiters in their fasta files as GENCODE (|) does.

ADD REPLY
0
Entering edit mode

Great, thanks! I'll try it out.

ADD REPLY
0
Entering edit mode
5 months ago
h.mon 33k

One way is to use makeTxDbFromGFF() (from the package GenomicFeatures) to build a TxDb from the EquCab gff3 or gtf annotation, then proceed as indicated in the tximport documentation:

library( GenomicFeatures )
txdb <- makeTxDbFromGFF( file = "/path/to/GCF_002863925.1_EquCab3.0_genomic.gtf.gz", format = "gtf" )
k <- keys( txdb, keytype = "TXNAME" )
tx2gene <- select( txdb, k, "GENEID", "TXNAME" )
ADD COMMENT

Login before adding your answer.

Traffic: 2317 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6