gffread error when extracting transcript sequences from gtf, coordinates exceed sequence length
1
1
Entering edit mode
4.2 years ago
jjrin ▴ 40

Hello, I am trying to extract some transcript sequences from a stringtie merged gtf using gffread and am getting the following errors:

Error (GFaSeqGet): subsequence cannot be larger than 227
Error getting subseq for MSTRG.17.1 (1..229)!

This error happens for many of gene entries, some but not all transcripts end up getting extracted. The gtf file was created in StringTie using the same reference genome file as everything else (where my bam files are from, the reference annotation gtf file). This annotation is from running StringTie first and merging the gtf annotation files. The main problem seems to be that the end coordinates exceed sequence length when extracting transcripts even though I am sure that they were made correctly. If there is a way to just skip that particular gene that could work as well.

This is the command I used: gffread /Users/Desktop/StringTie_comp_annotation_new.gtf -g /Users/Desktop/Xla.v91.repeatMasked.fa -w transcripts.fasta

cufflinks stringtie rnaseq transcript gffread • 2.9k views
ADD COMMENT
4
Entering edit mode
4.2 years ago

Hi, jjrin.

I'm having a similar situation here. I have a BAM file from Bowtie2 and a BAM file from BWA. The Bowtie2 BAM file works fine but when I run gffread with the BWA BAM file I got the same error. I found in other threads that it may be related with soft-clipped reads extending beyond the contig.

Anyway, I got a workarround by just replacing the end position of the GTF file by the sequence size on my reference FASTA file. If this solution is a valid one to you, I let the script here: gtf_fixer_togffread.py

ADD COMMENT
0
Entering edit mode

Thank you so much for this. I tried to run it but I get a key error which is probably due to my gtf file.

Traceback (most recent call last):
  File "/Users/margaretsaha/Desktop/gtf_fixer_to_gffread.py", line 23, in <module>
    transcript_size = int(transcript[header])
KeyError: 'Scaffold100'

My gtf file includes scaffolds as well as chromosomes which may complicate this, however my fasta file also has these scaffolds. Is there a workaround where I can just skip the scaffolds or include them?

Here is the first line of my gtf file:

Scaffold100 StringTie   transcript  65415   65755   .   +   .   transcript_id "MSTRG.5.1"; gene_id "MSTRG.5"; xloc "XLOC_000001"; class_code "u"; tss_id "TSS1";
Scaffold100 StringTie   exon    65415   65755   .   +   .   transcript_id "MSTRG.5.1"; gene_id "MSTRG.5"; exon_number "1";
ADD REPLY
0
Entering edit mode

It looks like the sequence names in the fasta file are different. Can I see the head of your fasta file?

ADD REPLY
0
Entering edit mode

My fasta file looks like this:

>Scaffold100
gttcaaactttcataacctgccaaattttgtaaaatgaacatggtgtggccacaaaaatggtcgtggtcaaaaaattcac
tgcgcgcaagtttttttgtccctctttttatttccaaaatgttgggaggtATttagacatttcacatttatattctcata
taccacactttccacattacacaATCTTCAGCCACCTTATAAATGGAACCTGCCGGTGCTACTGCTTTACATTGCTCTCT
CTTTATATAAATGAATGAGGCAATGTGATAGTGGAAATGTGTCCGATATAATGTAATTATAAGGTTCGACACTTAGAATA
AGAGAAGGAAAACTCTTCTGCATGTGATGAGAATCCTGCAGTGTGAGATGTGGCCTCAAGGGGGAGACATTTCCTTCACC
AAATTATCCCAGAGAAAGATCATTTTTATTCTGAGTTGGAAGCTGAGGAACGCGTTTCGTGTTTAATTCCATTTAAAAAA
AAAAACTTTTAAAAGGCGATAAAACGAATCTCTTGCAAATTTCCTATGTCCAACTGCCCTGTGATTCTATTGGGGGGGTG
TCCTCCCATGGGAACTGCCTGGCGACAGATTGAGGACGAGGTTTTTCATGGGGAATTTAAAACAATTTATCAACAGAAAC
AAAAGTGTGAACGTGTTATTTTTAGTAAACCTGCATTATAGGGGTggttcacctttaattaaactagtatgctatagaat

I have both a linearized and a line separated version of the fasta files.

ADD REPLY
1
Entering edit mode

Ok, you may try downloading the updated code and running it again.

Please, let me know if it works, I'm really curious about that.

ADD REPLY
0
Entering edit mode

It works great, thank you so much! The fixed annotation looks good (you have to remove mtDNA but I was going to do that anyways) and running through gffread is seamless. Thanks again for the help!

ADD REPLY
0
Entering edit mode

Dear Victor I'm using your pyton script to fix my gtf but I got the following error: Traceback (most recent call last): File "/home/uva_dpvrf_1/uva_dpvrf_1_1/bin/gtf_fixer_to_gffread.py", line 23, in <module> transcript_end = int(line_list[4]) IndexError: list index out of range I have no idea how to program in python so I'm not able to understand what to change in the script... could you help me? any ideas? Thank you so much in advance!

ADD REPLY
0
Entering edit mode

It is already fixed, my gtf had an awful 2-line header that I had to remove. Now it is working perfectly, thank you for your script!

ADD REPLY

Login before adding your answer.

Traffic: 1007 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6