Question: gffread error when extracting transcript sequences from gtf, coordinates exceed sequence length
0
gravatar for jjrin
20 months ago by
jjrin10
jjrin10 wrote:

Hello, I am trying to extract some transcript sequences from a stringtie merged gtf using gffread and am getting the following errors:

Error (GFaSeqGet): subsequence cannot be larger than 227
Error getting subseq for MSTRG.17.1 (1..229)!

This error happens for many of gene entries, some but not all transcripts end up getting extracted. The gtf file was created in StringTie using the same reference genome file as everything else (where my bam files are from, the reference annotation gtf file). This annotation is from running StringTie first and merging the gtf annotation files. The main problem seems to be that the end coordinates exceed sequence length when extracting transcripts even though I am sure that they were made correctly. If there is a way to just skip that particular gene that could work as well.

This is the command I used: gffread /Users/Desktop/StringTie_comp_annotation_new.gtf -g /Users/Desktop/Xla.v91.repeatMasked.fa -w transcripts.fasta

ADD COMMENTlink modified 20 months ago by victor.gambarini30 • written 20 months ago by jjrin10
2
gravatar for victor.gambarini
20 months ago by
victor.gambarini30 wrote:

Hi, jjrin.

I'm having a similar situation here. I have a BAM file from Bowtie2 and a BAM file from BWA. The Bowtie2 BAM file works fine but when I run gffread with the BWA BAM file I got the same error. I found in other threads that it may be related with soft-clipped reads extending beyond the contig.

Anyway, I got a workarround by just replacing the end position of the GTF file by the sequence size on my reference FASTA file. If this solution is a valid one to you, I let the script here: gtf_fixer_togffread.py

ADD COMMENTlink written 20 months ago by victor.gambarini30

Thank you so much for this. I tried to run it but I get a key error which is probably due to my gtf file.

Traceback (most recent call last):
  File "/Users/margaretsaha/Desktop/gtf_fixer_to_gffread.py", line 23, in <module>
    transcript_size = int(transcript[header])
KeyError: 'Scaffold100'

My gtf file includes scaffolds as well as chromosomes which may complicate this, however my fasta file also has these scaffolds. Is there a workaround where I can just skip the scaffolds or include them?

Here is the first line of my gtf file:

Scaffold100 StringTie   transcript  65415   65755   .   +   .   transcript_id "MSTRG.5.1"; gene_id "MSTRG.5"; xloc "XLOC_000001"; class_code "u"; tss_id "TSS1";
Scaffold100 StringTie   exon    65415   65755   .   +   .   transcript_id "MSTRG.5.1"; gene_id "MSTRG.5"; exon_number "1";
ADD REPLYlink written 20 months ago by jjrin10

It looks like the sequence names in the fasta file are different. Can I see the head of your fasta file?

ADD REPLYlink written 20 months ago by victor.gambarini30

My fasta file looks like this:

>Scaffold100
gttcaaactttcataacctgccaaattttgtaaaatgaacatggtgtggccacaaaaatggtcgtggtcaaaaaattcac
tgcgcgcaagtttttttgtccctctttttatttccaaaatgttgggaggtATttagacatttcacatttatattctcata
taccacactttccacattacacaATCTTCAGCCACCTTATAAATGGAACCTGCCGGTGCTACTGCTTTACATTGCTCTCT
CTTTATATAAATGAATGAGGCAATGTGATAGTGGAAATGTGTCCGATATAATGTAATTATAAGGTTCGACACTTAGAATA
AGAGAAGGAAAACTCTTCTGCATGTGATGAGAATCCTGCAGTGTGAGATGTGGCCTCAAGGGGGAGACATTTCCTTCACC
AAATTATCCCAGAGAAAGATCATTTTTATTCTGAGTTGGAAGCTGAGGAACGCGTTTCGTGTTTAATTCCATTTAAAAAA
AAAAACTTTTAAAAGGCGATAAAACGAATCTCTTGCAAATTTCCTATGTCCAACTGCCCTGTGATTCTATTGGGGGGGTG
TCCTCCCATGGGAACTGCCTGGCGACAGATTGAGGACGAGGTTTTTCATGGGGAATTTAAAACAATTTATCAACAGAAAC
AAAAGTGTGAACGTGTTATTTTTAGTAAACCTGCATTATAGGGGTggttcacctttaattaaactagtatgctatagaat

I have both a linearized and a line separated version of the fasta files.

ADD REPLYlink modified 20 months ago • written 20 months ago by jjrin10
1

Ok, you may try downloading the updated code and running it again.

Please, let me know if it works, I'm really curious about that.

ADD REPLYlink written 20 months ago by victor.gambarini30

It works great, thank you so much! The fixed annotation looks good (you have to remove mtDNA but I was going to do that anyways) and running through gffread is seamless. Thanks again for the help!

ADD REPLYlink modified 20 months ago • written 20 months ago by jjrin10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 789 users visited in the last hour