StringTIe Error: no valid ID found for GFF record
1
0
Entering edit mode
11 months ago
1234gingko ▴ 10

hi, I successfully aligned and analyzed my RNA-Seq data using Hisat2 and StringTIe and DESeq2 with the La_Amiga3_1 genome (white lupin) from NCBI to map transcripts. Beginner's luck. Now I am trying to do the exact same thing using the CNRS_Lalb genome (also white lupin on NCBI), and when I get to the first StringTIe step, I get "Error: no valid ID found for GFF record". I have looked at both the genome GTF files, and the first field (chromosome id) looks great (cut -f 1 *.gtf | sort | uniq) and they have a different name for the chromosomes, but look fine. I don't think that is the problem, and am looking for more hints as to what this means - I did read the StringTie manual but need more help. thanks very much, K

RNA-Seq • 1.5k views
ADD COMMENT
1
Entering edit mode

omg, thanks so much. this enabled me to find a prior post: Ensembl GTF format: isn't the tag "transcript_id" mandatory?
in which Ensembl explains the evolution of GTF format and suggests exactly what you suggest:
"I would recommend removing the gene lines from the gtf file". This gets me back on track so fast, I appreciate it! - Karen

ADD REPLY
0
Entering edit mode

Can you please post a couple of lines of the GTF file?

ADD REPLY
0
Entering edit mode

sure, thanks:

head -50 CN*/*.gtf
#gtf-version 2.2
#!genome-build CNRS_Lalb_1.0
#!genome-build-accession NCBI_Assembly:GCA_009771035.1
WOCE01000065.1  Genbank gene    90  241 .   -   .   gene_id "Lalb_Chr00c40g0409271"; transcript_id ""; gbkey "Gene"; gene_biotype "ncRNA"; locus_tag "Lalb_Chr00c40g0409271"; 
WOCE01000065.1  Genbank exon    90  241 .   -   .   gene_id "Lalb_Chr00c40g0409271"; transcript_id "Lalb_Chr00c40g0409271"; gbkey "ncRNA"; locus_tag "Lalb_Chr00c40g0409271"; product "hypothetical ncRNA"; exon_number "1"; 
WOCE01000065.1  Genbank gene    417 575 .   -   .   gene_id "Lalb_Chr00c40g0409281"; transcript_id ""; gbkey "Gene"; gene_biotype "ncRNA"; locus_tag "Lalb_Chr00c40g0409281"; 
WOCE01000065.1  Genbank exon    417 575 .   -   .   gene_id "Lalb_Chr00c40g0409281"; transcript_id "Lalb_Chr00c40g0409281"; gbkey "ncRNA"; locus_tag "Lalb_Chr00c40g0409281"; product "hypothetical ncRNA"; exon_number "1"; 
WOCE01000065.1  Genbank gene    1131    1283    .   -   .   gene_id "Lalb_Chr00c40g0409291"; transcript_id ""; gbkey "Gene"; gene_biotype "ncRNA"; locus_tag "Lalb_Chr00c40g0409291"; 
WOCE01000065.1  Genbank exon    1131    1283    .   -   .   gene_id "Lalb_Chr00c40g0409291"; transcript_id "Lalb_Chr00c40g0409291"; gbkey "ncRNA"; locus_tag "Lalb_Chr00c40g0409291"; product "hypothetical ncRNA"; exon_number "1"; 
WOCE01000065.1  Genbank gene    1698    1816    .   -   .   gene_id "Lalb_Chr00c40g0409301"; transcript_id ""; gbkey "Gene"; gene_biotype "rRNA"; locus_tag "Lalb_Chr00c40g0409301"; note "5s_rRNA"; 
WOCE01000065.1  Genbank exon    1698    1816    .   -   .   gene_id "Lalb_Chr00c40g0409301"; transcript_id "Lalb_Chr00c40g0409301"; gbkey "rRNA"; locus_tag "Lalb_Chr00c40g0409301"; product "5S ribosomal RNA"; exon_number "1"; 
WOCE01000065.1  Genbank gene    2001    2152    .   -   .   gene_id "Lalb_Chr00c40g0409311"; transcript_id ""; gbkey "Gene"; gene_biotype "ncRNA"; locus_tag "Lalb_Chr00c40g0409311"; 
WOCE01000065.1  Genbank exon    2001    2152    .   -   .   gene_id "Lalb_Chr00c40g0409311"; transcript_id "Lalb_Chr00c40g0409311"; gbkey "ncRNA"; locus_tag "Lalb_Chr00c40g0409311"; product "hypothetical ncRNA"; exon_number "1"; 
WOCE01000065.1  Genbank gene    2330    2481    .   -   .   gene_id "Lalb_Chr00c40g0409321"; transcript_id ""; gbkey "Gene"; gene_biotype "ncRNA"; locus_tag "Lalb_Chr00c40g0409321"; 
WOCE01000065.1  Genbank exon    2330    2481    .   -   .   gene_id "Lalb_Chr00c40g0409321"; transcript_id "Lalb_Chr00c40g0409321"; gbkey "ncRNA"; locus_tag "Lalb_Chr00c40g0409321"; product "hypothetical ncRNA"; exon_number "1"; 
WOCE01000065.1  Genbank gene    2659    2810    .   -   .   gene_id "Lalb_Chr00c40g0409331"; transcript_id ""; gbkey "Gene"; gene_biotype "ncRNA"; locus_tag "Lalb_Chr00c40g0409331";
ADD REPLY
1
Entering edit mode
11 months ago

My guess is its those lines with transcript_id=="", they don't contain a valid ID, and so StringTie is complaining. Its always a bit of the worry to work out what to do with a the transcript_id field on gene lines in a GTF file. The orignal GTF format didn't contain gene lines, but they appear to have crept in at some point. The ENSEMBL files just don't have a transcript_id field on their gene lines, but i bet that trips StringTie up as well.

For for what to do: I recommend just removing the gene lines. They are not necessary anyway. Something like:

awk '$3 != "gene" ' my_annotation.gtf > my_annotation_no_genes.gtf
ADD COMMENT

Login before adding your answer.

Traffic: 1599 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6