Given a genomic fasta file and a GFF3 file like this (this is just one transcript):
groupI AUGUSTUS gene 24128 27467 1 + . ID=g2 groupI AUGUSTUS transcript 24128 27467 . + . ID=g2.t1;Parent=g2 groupI AUGUSTUS start_codon 24128 24130 . + 0 Parent=g2.t1 groupI AUGUSTUS CDS 24128 24311 . + 0 ID=g2.t1.cds;Parent=g2.t1 groupI AUGUSTUS CDS 25287 25354 . + 2 ID=g2.t1.cds;Parent=g2.t1 groupI AUGUSTUS CDS 27267 27467 . + 0 ID=g2.t1.cds;Parent=g2.t1 groupI AUGUSTUS stop_codon 27465 27467 . + 0 Parent=g2.t1
Is there a simple way to extract a fasta of the CDS (or of the translated sequence) for each transcript?
I have already seen:
- Extract Cds Fastas From A Gff Annotation + Reference Sequence where some people have posted and amended perl scripts to do it
- http://biopython.org/wiki/Gene_predictions_to_protein_sequences a script base that uses a non-standard part of Biopython
- also I have tried to use cufflinks's gffread like this:
gffread file.gff3 -g genome.fa -y proteins.fabut the proteins sequence are nonsense, although the lengths are consistent.. (Am I doing something wrong?)
- edit: as well as bedtools getfasta, but I get each element's sequence separately (see my comments below)
Isn't there something more simple and reliable than the above custom scripts?