also I have tried to use cufflinks's gffread like this: gffread file.gff3 -g genome.fa -y proteins.fa but the proteins sequence are nonsense, although the lengths are consistent.. (Am I doing something wrong?)
edit: as well as bedtools getfasta, but I get each element's sequence separately (see my comments below)
Isn't there something more simple and reliable than the above custom scripts?
Are you happy with R/Bioconductor? There is a solution here which could solve your problem. Look for item #6 of 'Alignments (and genomic annotations)'. I will transcribe the relevant (but incomplete) bits of code here for convenience, but all credits go to Martin Morgan.
## cds for all transcripts
cdsByTx <- cdsBy(TxDb.Hsapiens.UCSC.hg19.knownGene, "tx")
## reference genome
dna <- extractTranscriptSeqs(BSgenome.Hsapiens.UCSC.hg19, cdsByTx)
protein <- translate(dna)
The xtractore program from the AEGeAn Toolkit should work for this. Set --type=CDS on the command line. For CDSs and other discontiguous features, xtractore pastes the fragments together correctly, taking into account strand, etc.
You can also get complete sequences of (converted) proteins using prot instead of using CDS or cDNAs using cDNA or genes using gene.
When you get the sequence for each CDS separately for instance with bedtools getfasta, first of all you get each piece of a CDS separately rather than getting the CDS sequences of a gene combined, and second for genes with isoforms, you can get the same region more than once in overlapping sequences which is not good if you are trying to analyze the coding region of a genome. As far as I managed to understand, this script gets the longest isoform version possible for an exon, and merges them for each gene.
The script is designed for the EVidenceModeler pipeline, so it doesn't have to work with the gffs coming from all different annotation pipelines, but so far it was working well for me, see this post if you also run into the error that I just faced with one gff from RefSeq: