Question: Creating fasta files to BLAST- what's that fastest way
gravatar for cadeans
2.4 years ago by
cadeans10 wrote:


I've done a rather large transcriptome experiment with a non-model organism. Although its not published yet, I did have access to a reference genome and gff annotation file. I'm at the point where I have lists of DE genes for different contrasts and want to find out what these genes are/do. Because the genome isn't published I can't just use the gene IDs in BLAST. Instead I have to use them to find the protein sequences they correspond to in the gff file and make a fasta file with this sequence info and then BLAST it. Across all the contrasts I've done I probably have upwards of 5,000 genes to BLAST and I'm wondering what the most efficient way to do this is.

Because the protein sequence info in the gff file isn't really listed in its own cell, I can't figure out a way to to just pull out the sequences for the genes I want. The best I can do is convert the gff into a text file and use FIND to locate the gene IDs, then cut/paste the sequences as I make the fasta file that will eventually get BLASTed. With this many genes, it will take a very long time to do this and I just want to know if there's an easier way--- without having to do any coding (which I unfortunately am not proficient in). I've tried to use R to pull the rows I want, but like I said, the protein sequences don't even show up when I convert the gff file to a table. I'll do it by hand if I have to but I just want to make sure I'm not overlooking a faster way.

thanks, carrie

blast gff fasta • 918 views
ADD COMMENTlink modified 2.4 years ago by Carlo Yague4.1k • written 2.4 years ago by cadeans10

It would help if you could show a few lines of the gff file.

ADD REPLYlink written 2.4 years ago by Carlo Yague4.1k

Sure. Here's a link to a screen shot of the gff file in galaxy.

ADD REPLYlink written 2.4 years ago by cadeans10

I'm not sure I understand you completely. Is this a reference based transcriptome assembly (e.g. cufflinks)?

The GFF shouldn't have sequences in it. Typically they're used to mark features of genes in a genome (UTRs, exons, etc).

The GFF will have the locations of features in your genome. You'll have to use the transcript or CDS features extract the transcripts , from there you'll have to predict your protein sequences from the transcripts. There are several scripts posted on here and other places for extracting sequences from a GFF. For the peptide predictions I would use TransDecoder.

ADD REPLYlink written 2.4 years ago by pld4.7k

I used TopHat with a reference genome. It's just that the genome is very new and the lab that provided me with it hasn't published it yet. I guess my issue is, now that I have done my DE analysis and have lists of DE gene IDs, how do I assemble the sequences that correspond to these IDs into a fasta file to BLAST them (without have to build the fasta files by hand, e.g. cutting/pasting the sequences one by one for each DE gene).

thanks, carrie

ADD REPLYlink written 2.4 years ago by cadeans10

There's a tool in the TopHat suite just for this purpose:

See the gffread utility. Not sure how it will treat the protein sequences in the GFF file. Where did these protein sequences come from? Did the lab you got the genome from provide you with annotations for the genome?

ADD REPLYlink written 2.4 years ago by pld4.7k
gravatar for Carlo Yague
2.4 years ago by
Carlo Yague4.1k
Carlo Yague4.1k wrote:

Ok so from your question and the picture, I understand that you just want to parse your gff file to retrieve the sequences as fasta for the genes that are differentially expressed.

You can do that using the terminal with grep, sed and awk. For this I imagine that your genes ID you are interested in are saved in a file with one ID per line (myIDs.txt). The grep command will extract the lines of your GFF (myGFF.gff) file that match the IDs, then the result is sent to sed that will change the delimiters to faciliate parsing with awk and save the results in myFasta.fasta

grep -F -f myIDs.txt myGFF.gff | sed "s/\=/\;/g" | sed "s/\[/\;/g" | sed "s/]/\;/g" | awk -F ";" '{print ">"$2"\n"$7}' > myFasta.fasta

At least, it works with my toy example :

echo -e "scaff\tScipio\tgene\t8180\t86694\t.\t-\t.\tID=1710089;Name=Hz5G2000001;protein sequence=[ADADADJIJAODZD]"  | sed "s/\=/\;/g" | sed "s/\[/\;/g" | sed "s/]/\;/g" | awk -F ";" '{print ">"$2"\n"$7}'
ADD COMMENTlink modified 2.4 years ago • written 2.4 years ago by Carlo Yague4.1k
gravatar for mastal511
2.4 years ago by
mastal5111.9k wrote:

Have you looked at the R/Bioconductor packages IRanges and GenomicRanges?

ADD COMMENTlink written 2.4 years ago by mastal5111.9k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 573 users visited in the last hour