Hello, I am trying to extract some transcript sequences from a stringtie merged gtf using gffread and am getting the following errors:
Error (GFaSeqGet): subsequence cannot be larger than 227
Error getting subseq for MSTRG.17.1 (1..229)!
This error happens for many of gene entries, some but not all transcripts end up getting extracted. The gtf file was created in StringTie using the same reference genome file as everything else (where my bam files are from, the reference annotation gtf file). This annotation is from running StringTie first and merging the gtf annotation files. The main problem seems to be that the end coordinates exceed sequence length when extracting transcripts even though I am sure that they were made correctly. If there is a way to just skip that particular gene that could work as well.
This is the command I used: gffread /Users/Desktop/StringTie_comp_annotation_new.gtf -g /Users/Desktop/Xla.v91.repeatMasked.fa -w transcripts.fasta
Thank you so much for this. I tried to run it but I get a key error which is probably due to my gtf file.
My gtf file includes scaffolds as well as chromosomes which may complicate this, however my fasta file also has these scaffolds. Is there a workaround where I can just skip the scaffolds or include them?
Here is the first line of my gtf file:
It looks like the sequence names in the fasta file are different. Can I see the head of your fasta file?
My fasta file looks like this:
I have both a linearized and a line separated version of the fasta files.
Ok, you may try downloading the updated code and running it again.
Please, let me know if it works, I'm really curious about that.
It works great, thank you so much! The fixed annotation looks good (you have to remove mtDNA but I was going to do that anyways) and running through gffread is seamless. Thanks again for the help!
Dear Victor I'm using your pyton script to fix my gtf but I got the following error: Traceback (most recent call last): File "/home/uva_dpvrf_1/uva_dpvrf_1_1/bin/gtf_fixer_to_gffread.py", line 23, in <module> transcript_end = int(line_list[4]) IndexError: list index out of range I have no idea how to program in python so I'm not able to understand what to change in the script... could you help me? any ideas? Thank you so much in advance!
It is already fixed, my gtf had an awful 2-line header that I had to remove. Now it is working perfectly, thank you for your script!