Given a genomic fasta file and a GFF3 file like this (this is just one transcript):
groupI AUGUSTUS gene 24128 27467 1 + . ID=g2
groupI AUGUSTUS transcript 24128 27467 . + . ID=g2.t1;Parent=g2
groupI AUGUSTUS start_codon 24128 24130 . + 0 Parent=g2.t1
groupI AUGUSTUS CDS 24128 24311 . + 0 ID=g2.t1.cds;Parent=g2.t1
groupI AUGUSTUS CDS 25287 25354 . + 2 ID=g2.t1.cds;Parent=g2.t1
groupI AUGUSTUS CDS 27267 27467 . + 0 ID=g2.t1.cds;Parent=g2.t1
groupI AUGUSTUS stop_codon 27465 27467 . + 0 Parent=g2.t1
Is there a simple way to extract a fasta of the CDS (or of the translated sequence) for each transcript?
I have already seen:
- Extract Cds Fastas From A Gff Annotation + Reference Sequence where some people have posted and amended perl scripts to do it
- http://biopython.org/wiki/Gene_predictions_to_protein_sequences a script base that uses a non-standard part of Biopython
- also I have tried to use cufflinks's gffread like this:
gffread file.gff3 -g genome.fa -y proteins.fa
but the proteins sequence are nonsense, although the lengths are consistent.. (Am I doing something wrong?) - edit: as well as bedtools getfasta, but I get each element's sequence separately (see my comments below)
Isn't there something more simple and reliable than the above custom scripts?
Did you try bedtools getfasta ?
I did, see my answer to WouterDeCoster below. I cannot get per-transcript CDS sequences, I just get each element (gene, transcript, start/end codons, and each CDS fragment) separately. Is this the expected result?
I see what you mean. This might help you A: Gencode Gtf To Bed12 Or Psl