StringTIe Error: no valid ID found for GFF record
3
4
Entering edit mode
3.2 years ago
1234gingko ▴ 50

hi, I successfully aligned and analyzed my RNA-Seq data using Hisat2 and StringTIe and DESeq2 with the La_Amiga3_1 genome (white lupin) from NCBI to map transcripts. Beginner's luck. Now I am trying to do the exact same thing using the CNRS_Lalb genome (also white lupin on NCBI), and when I get to the first StringTIe step, I get "Error: no valid ID found for GFF record". I have looked at both the genome GTF files, and the first field (chromosome id) looks great (cut -f 1 *.gtf | sort | uniq) and they have a different name for the chromosomes, but look fine. I don't think that is the problem, and am looking for more hints as to what this means - I did read the StringTie manual but need more help. thanks very much, K

RNA-Seq • 7.6k views
ADD COMMENT
1
Entering edit mode

omg, thanks so much. this enabled me to find a prior post: Ensembl GTF format: isn't the tag "transcript_id" mandatory?
in which Ensembl explains the evolution of GTF format and suggests exactly what you suggest:
"I would recommend removing the gene lines from the gtf file". This gets me back on track so fast, I appreciate it! - Karen

ADD REPLY
0
Entering edit mode

Can you please post a couple of lines of the GTF file?

ADD REPLY
0
Entering edit mode

sure, thanks:

head -50 CN*/*.gtf
#gtf-version 2.2
#!genome-build CNRS_Lalb_1.0
#!genome-build-accession NCBI_Assembly:GCA_009771035.1
WOCE01000065.1  Genbank gene    90  241 .   -   .   gene_id "Lalb_Chr00c40g0409271"; transcript_id ""; gbkey "Gene"; gene_biotype "ncRNA"; locus_tag "Lalb_Chr00c40g0409271"; 
WOCE01000065.1  Genbank exon    90  241 .   -   .   gene_id "Lalb_Chr00c40g0409271"; transcript_id "Lalb_Chr00c40g0409271"; gbkey "ncRNA"; locus_tag "Lalb_Chr00c40g0409271"; product "hypothetical ncRNA"; exon_number "1"; 
WOCE01000065.1  Genbank gene    417 575 .   -   .   gene_id "Lalb_Chr00c40g0409281"; transcript_id ""; gbkey "Gene"; gene_biotype "ncRNA"; locus_tag "Lalb_Chr00c40g0409281"; 
WOCE01000065.1  Genbank exon    417 575 .   -   .   gene_id "Lalb_Chr00c40g0409281"; transcript_id "Lalb_Chr00c40g0409281"; gbkey "ncRNA"; locus_tag "Lalb_Chr00c40g0409281"; product "hypothetical ncRNA"; exon_number "1"; 
WOCE01000065.1  Genbank gene    1131    1283    .   -   .   gene_id "Lalb_Chr00c40g0409291"; transcript_id ""; gbkey "Gene"; gene_biotype "ncRNA"; locus_tag "Lalb_Chr00c40g0409291"; 
WOCE01000065.1  Genbank exon    1131    1283    .   -   .   gene_id "Lalb_Chr00c40g0409291"; transcript_id "Lalb_Chr00c40g0409291"; gbkey "ncRNA"; locus_tag "Lalb_Chr00c40g0409291"; product "hypothetical ncRNA"; exon_number "1"; 
WOCE01000065.1  Genbank gene    1698    1816    .   -   .   gene_id "Lalb_Chr00c40g0409301"; transcript_id ""; gbkey "Gene"; gene_biotype "rRNA"; locus_tag "Lalb_Chr00c40g0409301"; note "5s_rRNA"; 
WOCE01000065.1  Genbank exon    1698    1816    .   -   .   gene_id "Lalb_Chr00c40g0409301"; transcript_id "Lalb_Chr00c40g0409301"; gbkey "rRNA"; locus_tag "Lalb_Chr00c40g0409301"; product "5S ribosomal RNA"; exon_number "1"; 
WOCE01000065.1  Genbank gene    2001    2152    .   -   .   gene_id "Lalb_Chr00c40g0409311"; transcript_id ""; gbkey "Gene"; gene_biotype "ncRNA"; locus_tag "Lalb_Chr00c40g0409311"; 
WOCE01000065.1  Genbank exon    2001    2152    .   -   .   gene_id "Lalb_Chr00c40g0409311"; transcript_id "Lalb_Chr00c40g0409311"; gbkey "ncRNA"; locus_tag "Lalb_Chr00c40g0409311"; product "hypothetical ncRNA"; exon_number "1"; 
WOCE01000065.1  Genbank gene    2330    2481    .   -   .   gene_id "Lalb_Chr00c40g0409321"; transcript_id ""; gbkey "Gene"; gene_biotype "ncRNA"; locus_tag "Lalb_Chr00c40g0409321"; 
WOCE01000065.1  Genbank exon    2330    2481    .   -   .   gene_id "Lalb_Chr00c40g0409321"; transcript_id "Lalb_Chr00c40g0409321"; gbkey "ncRNA"; locus_tag "Lalb_Chr00c40g0409321"; product "hypothetical ncRNA"; exon_number "1"; 
WOCE01000065.1  Genbank gene    2659    2810    .   -   .   gene_id "Lalb_Chr00c40g0409331"; transcript_id ""; gbkey "Gene"; gene_biotype "ncRNA"; locus_tag "Lalb_Chr00c40g0409331";
ADD REPLY
4
Entering edit mode
3.2 years ago

My guess is its those lines with transcript_id=="", they don't contain a valid ID, and so StringTie is complaining. Its always a bit of the worry to work out what to do with a the transcript_id field on gene lines in a GTF file. The orignal GTF format didn't contain gene lines, but they appear to have crept in at some point. The ENSEMBL files just don't have a transcript_id field on their gene lines, but i bet that trips StringTie up as well.

For for what to do: I recommend just removing the gene lines. They are not necessary anyway. Something like:

awk '$3 != "gene" ' my_annotation.gtf > my_annotation_no_genes.gtf
ADD COMMENT
0
Entering edit mode
2.0 years ago
bio • 0

Hi! I also suffer the same problem,and i don't know how to fix it

ADD COMMENT
0
Entering edit mode
2.0 years ago
Juke34 8.5k

You can try AGAT

Input:

WOCE01000065.1  Genbank gene    90  241 .   -   .   gene_id "Lalb_Chr00c40g0409271"; transcript_id ""; gbkey "Gene"; gene_biotype "ncRNA"; locus_tag "Lalb_Chr00c40g0409271"; 
WOCE01000065.1  Genbank exon    90  241 .   -   .   gene_id "Lalb_Chr00c40g0409271"; transcript_id "Lalb_Chr00c40g0409271"; gbkey "ncRNA"; locus_tag "Lalb_Chr00c40g0409271"; product "hypothetical ncRNA"; exon_number "1"; 
WOCE01000065.1  Genbank gene    417 575 .   -   .   gene_id "Lalb_Chr00c40g0409281"; transcript_id ""; gbkey "Gene"; gene_biotype "ncRNA"; locus_tag "Lalb_Chr00c40g0409281"; 
WOCE01000065.1  Genbank exon    417 575 .   -   .   gene_id "Lalb_Chr00c40g0409281"; transcript_id "Lalb_Chr00c40g0409281"; gbkey "ncRNA"; locus_tag "Lalb_Chr00c40g0409281"; product "hypothetical ncRNA"; exon_number "1"

Remove transcript_id attribute to gene feature:
agat_sp_manage_attributes.pl --gff test.gtf -p gene --att transcript_id -o test.gff

Output:

##gff-version 3
WOCE01000065.1  Genbank gene    90  241 .   -   .   ID=nbis-gene-1;gbkey=Gene;gene_biotype=ncRNA;gene_id=Lalb_Chr00c40g0409271;locus_tag=Lalb_Chr00c40g0409271
WOCE01000065.1  Genbank RNA 90  241 .   -   .   ID=Lalb_Chr00c40g0409271;Parent=nbis-gene-1;gbkey=Gene;gene_biotype=ncRNA;gene_id=Lalb_Chr00c40g0409271;locus_tag=Lalb_Chr00c40g0409271
WOCE01000065.1  Genbank exon    90  241 .   -   .   ID=exon-1;Parent=Lalb_Chr00c40g0409271;exon_number=1;gbkey=ncRNA;gene_id=Lalb_Chr00c40g0409271;locus_tag=Lalb_Chr00c40g0409271;product=hypothetical ncRNA;transcript_id=Lalb_Chr00c40g0409271
WOCE01000065.1  Genbank gene    417 575 .   -   .   ID=nbis-gene-2;gbkey=Gene;gene_biotype=ncRNA;gene_id=Lalb_Chr00c40g0409281;locus_tag=Lalb_Chr00c40g0409281
WOCE01000065.1  Genbank RNA 417 575 .   -   .   ID=Lalb_Chr00c40g0409281;Parent=nbis-gene-2;gbkey=Gene;gene_biotype=ncRNA;gene_id=Lalb_Chr00c40g0409281;locus_tag=Lalb_Chr00c40g0409281
WOCE01000065.1  Genbank exon    417 575 .   -   .   ID=exon-2;Parent=Lalb_Chr00c40g0409281;exon_number=1;gbkey=ncRNA;gene_id=Lalb_Chr00c40g0409281;locus_tag=Lalb_Chr00c40g0409281;product=hypothetical ncRNA;transcript_id=Lalb_Chr00c40g0409281

Convert into GTF agat_convert_sp_gff2gtf.pl --gff test.gff -o --gff test_clean.gtf

Output:

##gtf-version 3
WOCE01000065.1  Genbank gene    90  241 .   -   .   gene_id "Lalb_Chr00c40g0409271"; ID "nbis-gene-1"; gbkey "Gene"; gene_biotype "ncRNA"; locus_tag "Lalb_Chr00c40g0409271";
WOCE01000065.1  Genbank transcript  90  241 .   -   .   gene_id "Lalb_Chr00c40g0409271"; transcript_id "Lalb_Chr00c40g0409271"; ID "Lalb_Chr00c40g0409271"; Parent "nbis-gene-1"; gbkey "Gene"; gene_biotype "ncRNA"; locus_tag "Lalb_Chr00c40g0409271"; original_biotype "rna";
WOCE01000065.1  Genbank exon    90  241 .   -   .   gene_id "Lalb_Chr00c40g0409271"; transcript_id "Lalb_Chr00c40g0409271"; ID "exon-1"; Parent "Lalb_Chr00c40g0409271"; exon_number "1"; gbkey "ncRNA"; locus_tag "Lalb_Chr00c40g0409271"; product "hypothetical ncRNA";
WOCE01000065.1  Genbank gene    417 575 .   -   .   gene_id "Lalb_Chr00c40g0409281"; ID "nbis-gene-2"; gbkey "Gene"; gene_biotype "ncRNA"; locus_tag "Lalb_Chr00c40g0409281";
WOCE01000065.1  Genbank transcript  417 575 .   -   .   gene_id "Lalb_Chr00c40g0409281"; transcript_id "Lalb_Chr00c40g0409281"; ID "Lalb_Chr00c40g0409281"; Parent "nbis-gene-2"; gbkey "Gene"; gene_biotype "ncRNA"; locus_tag "Lalb_Chr00c40g0409281"; original_biotype "rna";
WOCE01000065.1  Genbank exon    417 575 .   -   .   gene_id "Lalb_Chr00c40g0409281"; transcript_id "Lalb_Chr00c40g0409281"; ID "exon-2"; Parent "Lalb_Chr00c40g0409281"; exon_number "1"; gbkey "ncRNA"; locus_tag "Lalb_Chr00c40g0409281"; product "hypothetical ncRNA";
ADD COMMENT

Login before adding your answer.

Traffic: 2611 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6