I want to use htseq-count http://www-huber.embl.de/users/anders/HTSeq/doc/count.html to get gene counts (RPKM) to analyze DE genes in a RNAseq experiment. I'm NOT interested in alternative splicing, only in getting RPKM values for downstream analyses with DEGseq, an R package. HTseq-count requires me a GFF file, but I only have my reference.fasta. Is there any way I can use the fasta file or convert it to GFF?
I do not think its wise to use RPKM values for DGE in edgeR or DEGSEQ R packages because they are not raw counts but have been normalized already.
If you want to do this.....get this script from https://github.com/vsbuffalo/sam2counts and count raw reads that map to features in a SAM file.
Then choose edgeR over DEGSEQ because you can normalize these raw counts to account for library size and so on .....
Each file type was invented to represent certain type of information. The fasta file was meant to store sequences, a GFF file was meant to represents genomic features (intervals). In general there is no way to directly convert between the two.
As daler above points out, if your fasta file happens to store each gene separately and also lists extra information about the coordinates then we could give you a parser that generates a GFF from it (post the header). Another option if you knew the gene sequences you could align these to the genome and thus creating your own annotations.