generating transcript to gene mapping with unhelpful GFF3 file
0
0
Entering edit mode
11 weeks ago
thomas.welch ▴ 50

Hi there,

I am wondering if anyone could provide a tip or any help with generating the necessary transcript to gene map file necessary for using salmon to align RNAseq data against a reference transcriptome?

I would like to do this with the QUT nicotiana benthamiana reference transcriptome. However, the way in which the GFF3 file for the annotation is constructed makes this not possible using the BUSparse package, and there is no gtf file where "transcript_id" and "gene_id" are helpfully specified.

in the attributes column of the gff file, it's not obvious to me which tag denotes transcript, and which is gene. But i'm guessing that (for my purposes at least) "Nbv5tr6198039.mrna1" for example may be considered transcript id, while "Nbv5tr6198039" may be considered gene id. Please see below some example lines from the GFF3 file.

Nbv0.5scaffold4004  Nbdbv05 gene    109116  109315  .   -   .   ID=Nbv5tr6198039.path1;Name=not determined by homology or low homology during annotation
Nbv0.5scaffold4004  Nbdbv05 mRNA    109116  109315  .   -   .   ID=Nbv5tr6198039.mrna1;Name=Nbv5tr6198039;Parent=Nbv5tr6198039.path1;coverage=100.0;identity=100.0
Nbv0.5scaffold4004  Nbdbv05 CDS 109168  109314  100 -   0   ID=Nbv5tr6198039.mrna1.cds1;Name=Nbv5tr6198039;Parent=Nbv5tr6198039.mrna1;Target=Nbv5tr6198039 2 148 +


Thanks in advance for any help.

salmon RNAseq transcriptome gff3 • 259 views
1
Entering edit mode

Yes the value of the ID attribute of the gene feature can be considered as the gene_id
the value of the ID attribute of the mRNA feature can be considered as the transcript_id

If you need a GTF file you may convert your GFF file using on of these tools: https://agat.readthedocs.io/en/latest/gff_to_gtf.html

0
Entering edit mode

Thank you very much for your help. I have now converted the GFF3 file into gtf format using gffread. However, before i use this gtf file to generate my transcript to gene mapping file (to use with salmon and eventual splicosomal analysis) I am thinking my gtf file may need further modification.

Lines in the gtf file currently look like below:

Nbv0.5scaffold427   Nbdbv05 transcript  69130   80827   .   +   .   transcript_id "Nbv5tr6227715.mrna1"; gene_id "Nbv5tr6227715.path1";
Nbv0.5scaffold51    Nbdbv05 transcript  325541  326260  .   +   .   transcript_id "Nbv5tr6227715.mrna2"; gene_id "Nbv5tr6227715.path2";


With gene_id given a different extension (".pathX") depending on the transcript. Am i right in thinking this should not be the case, and that different transcript id's of the same gene should map to exactly the same gene_id? if so, should i strip the ".pathX" gene_id extension from the file?

TLDR: should i modify my newly generated gtf so that the lines above look more like the lines below?

Nbv0.5scaffold427   Nbdbv05 transcript  69130   80827   .   +   .   transcript_id "Nbv5tr6227715.mrna1"; gene_id "Nbv5tr6227715";
Nbv0.5scaffold51    Nbdbv05 transcript  325541  326260  .   +   .   transcript_id "Nbv5tr6227715.mrna2"; gene_id "Nbv5tr6227715";

1
Entering edit mode

I guess you should not perform this modification otherwise you risk to merge different genes together. Looking at the location of mrna1 and mrna2 I don't think they are part of the same gene.