Question: gffread error when extracting transcript sequences from gtf, coordinates exceed sequence length
gravatar for jjrin
3.0 years ago by
jjrin20 wrote:

Hello, I am trying to extract some transcript sequences from a stringtie merged gtf using gffread and am getting the following errors:

Error (GFaSeqGet): subsequence cannot be larger than 227
Error getting subseq for MSTRG.17.1 (1..229)!

This error happens for many of gene entries, some but not all transcripts end up getting extracted. The gtf file was created in StringTie using the same reference genome file as everything else (where my bam files are from, the reference annotation gtf file). This annotation is from running StringTie first and merging the gtf annotation files. The main problem seems to be that the end coordinates exceed sequence length when extracting transcripts even though I am sure that they were made correctly. If there is a way to just skip that particular gene that could work as well.

This is the command I used: gffread /Users/Desktop/StringTie_comp_annotation_new.gtf -g /Users/Desktop/Xla.v91.repeatMasked.fa -w transcripts.fasta

ADD COMMENTlink modified 3.0 years ago by victor.gambarini30 • written 3.0 years ago by jjrin20
gravatar for victor.gambarini
3.0 years ago by
victor.gambarini30 wrote:

Hi, jjrin.

I'm having a similar situation here. I have a BAM file from Bowtie2 and a BAM file from BWA. The Bowtie2 BAM file works fine but when I run gffread with the BWA BAM file I got the same error. I found in other threads that it may be related with soft-clipped reads extending beyond the contig.

Anyway, I got a workarround by just replacing the end position of the GTF file by the sequence size on my reference FASTA file. If this solution is a valid one to you, I let the script here:

ADD COMMENTlink written 3.0 years ago by victor.gambarini30

Thank you so much for this. I tried to run it but I get a key error which is probably due to my gtf file.

Traceback (most recent call last):
  File "/Users/margaretsaha/Desktop/", line 23, in <module>
    transcript_size = int(transcript[header])
KeyError: 'Scaffold100'

My gtf file includes scaffolds as well as chromosomes which may complicate this, however my fasta file also has these scaffolds. Is there a workaround where I can just skip the scaffolds or include them?

Here is the first line of my gtf file:

Scaffold100 StringTie   transcript  65415   65755   .   +   .   transcript_id "MSTRG.5.1"; gene_id "MSTRG.5"; xloc "XLOC_000001"; class_code "u"; tss_id "TSS1";
Scaffold100 StringTie   exon    65415   65755   .   +   .   transcript_id "MSTRG.5.1"; gene_id "MSTRG.5"; exon_number "1";
ADD REPLYlink written 3.0 years ago by jjrin20

It looks like the sequence names in the fasta file are different. Can I see the head of your fasta file?

ADD REPLYlink written 3.0 years ago by victor.gambarini30

My fasta file looks like this:


I have both a linearized and a line separated version of the fasta files.

ADD REPLYlink modified 3.0 years ago • written 3.0 years ago by jjrin20

Ok, you may try downloading the updated code and running it again.

Please, let me know if it works, I'm really curious about that.

ADD REPLYlink written 3.0 years ago by victor.gambarini30

It works great, thank you so much! The fixed annotation looks good (you have to remove mtDNA but I was going to do that anyways) and running through gffread is seamless. Thanks again for the help!

ADD REPLYlink modified 3.0 years ago • written 3.0 years ago by jjrin20
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 699 users visited in the last hour