I've done a rather large transcriptome experiment with a non-model organism. Although its not published yet, I did have access to a reference genome and gff annotation file. I'm at the point where I have lists of DE genes for different contrasts and want to find out what these genes are/do. Because the genome isn't published I can't just use the gene IDs in BLAST. Instead I have to use them to find the protein sequences they correspond to in the gff file and make a fasta file with this sequence info and then BLAST it. Across all the contrasts I've done I probably have upwards of 5,000 genes to BLAST and I'm wondering what the most efficient way to do this is.
Because the protein sequence info in the gff file isn't really listed in its own cell, I can't figure out a way to to just pull out the sequences for the genes I want. The best I can do is convert the gff into a text file and use FIND to locate the gene IDs, then cut/paste the sequences as I make the fasta file that will eventually get BLASTed. With this many genes, it will take a very long time to do this and I just want to know if there's an easier way--- without having to do any coding (which I unfortunately am not proficient in). I've tried to use R to pull the rows I want, but like I said, the protein sequences don't even show up when I convert the gff file to a table. I'll do it by hand if I have to but I just want to make sure I'm not overlooking a faster way.