Creating fasta files to BLAST- what's that fastest way
2
0
Entering edit mode
8.1 years ago
cadeans ▴ 10

Hello,

I've done a rather large transcriptome experiment with a non-model organism. Although its not published yet, I did have access to a reference genome and gff annotation file. I'm at the point where I have lists of DE genes for different contrasts and want to find out what these genes are/do. Because the genome isn't published I can't just use the gene IDs in BLAST. Instead I have to use them to find the protein sequences they correspond to in the gff file and make a fasta file with this sequence info and then BLAST it. Across all the contrasts I've done I probably have upwards of 5,000 genes to BLAST and I'm wondering what the most efficient way to do this is.

Because the protein sequence info in the gff file isn't really listed in its own cell, I can't figure out a way to to just pull out the sequences for the genes I want. The best I can do is convert the gff into a text file and use FIND to locate the gene IDs, then cut/paste the sequences as I make the fasta file that will eventually get BLASTed. With this many genes, it will take a very long time to do this and I just want to know if there's an easier way--- without having to do any coding (which I unfortunately am not proficient in). I've tried to use R to pull the rows I want, but like I said, the protein sequences don't even show up when I convert the gff file to a table. I'll do it by hand if I have to but I just want to make sure I'm not overlooking a faster way.

thanks, carrie

blast fasta gff • 2.4k views
ADD COMMENT
1
Entering edit mode

It would help if you could show a few lines of the gff file.

ADD REPLY
0
Entering edit mode

Sure. Here's a link to a screen shot of the gff file in galaxy. http://imgur.com/TQu4qZK

ADD REPLY
0
Entering edit mode

I'm not sure I understand you completely. Is this a reference based transcriptome assembly (e.g. cufflinks)?

The GFF shouldn't have sequences in it. Typically they're used to mark features of genes in a genome (UTRs, exons, etc). http://www.sequenceontology.org/gff3.shtml

The GFF will have the locations of features in your genome. You'll have to use the transcript or CDS features extract the transcripts , from there you'll have to predict your protein sequences from the transcripts. There are several scripts posted on here and other places for extracting sequences from a GFF. For the peptide predictions I would use TransDecoder.

ADD REPLY
0
Entering edit mode

I used TopHat with a reference genome. It's just that the genome is very new and the lab that provided me with it hasn't published it yet. I guess my issue is, now that I have done my DE analysis and have lists of DE gene IDs, how do I assemble the sequences that correspond to these IDs into a fasta file to BLAST them (without have to build the fasta files by hand, e.g. cutting/pasting the sequences one by one for each DE gene).

thanks, carrie

ADD REPLY
0
Entering edit mode

There's a tool in the TopHat suite just for this purpose:

http://cole-trapnell-lab.github.io/cufflinks/file_formats/

See the gffread utility. Not sure how it will treat the protein sequences in the GFF file. Where did these protein sequences come from? Did the lab you got the genome from provide you with annotations for the genome?

ADD REPLY
1
Entering edit mode
8.1 years ago

Ok so from your question and the picture, I understand that you just want to parse your gff file to retrieve the sequences as fasta for the genes that are differentially expressed.

You can do that using the terminal with grep, sed and awk. For this I imagine that your genes ID you are interested in are saved in a file with one ID per line (myIDs.txt). The grep command will extract the lines of your GFF (myGFF.gff) file that match the IDs, then the result is sent to sed that will change the delimiters to faciliate parsing with awk and save the results in myFasta.fasta

grep -F -f myIDs.txt myGFF.gff | sed "s/\=/\;/g" | sed "s/\[/\;/g" | sed "s/]/\;/g" | awk -F ";" '{print ">"$2"\n"$7}' > myFasta.fasta

At least, it works with my toy example :

echo -e "scaff\tScipio\tgene\t8180\t86694\t.\t-\t.\tID=1710089;Name=Hz5G2000001;protein sequence=[ADADADJIJAODZD]"  | sed "s/\=/\;/g" | sed "s/\[/\;/g" | sed "s/]/\;/g" | awk -F ";" '{print ">"$2"\n"$7}'
>1710089
ADADADJIJAODZD
ADD COMMENT
0
Entering edit mode
8.1 years ago
mastal511 ★ 2.1k

Have you looked at the R/Bioconductor packages IRanges and GenomicRanges?

ADD COMMENT

Login before adding your answer.

Traffic: 2442 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6