Question

Creating fasta files to BLAST- what's that fastest way

0

Entering edit mode

8.1 years ago

cadeans ▴ 10

Hello,

I've done a rather large transcriptome experiment with a non-model organism. Although its not published yet, I did have access to a reference genome and gff annotation file. I'm at the point where I have lists of DE genes for different contrasts and want to find out what these genes are/do. Because the genome isn't published I can't just use the gene IDs in BLAST. Instead I have to use them to find the protein sequences they correspond to in the gff file and make a fasta file with this sequence info and then BLAST it. Across all the contrasts I've done I probably have upwards of 5,000 genes to BLAST and I'm wondering what the most efficient way to do this is.

Because the protein sequence info in the gff file isn't really listed in its own cell, I can't figure out a way to to just pull out the sequences for the genes I want. The best I can do is convert the gff into a text file and use FIND to locate the gene IDs, then cut/paste the sequences as I make the fasta file that will eventually get BLASTed. With this many genes, it will take a very long time to do this and I just want to know if there's an easier way--- without having to do any coding (which I unfortunately am not proficient in). I've tried to use R to pull the rows I want, but like I said, the protein sequences don't even show up when I convert the gff file to a table. I'll do it by hand if I have to but I just want to make sure I'm not overlooking a faster way.

thanks, carrie

blast fasta gff • 2.4k views

ADD COMMENT • link updated 8.1 years ago by Carlo Yague 8.7k • written 8.1 years ago by cadeans ▴ 10

1

Entering edit mode

It would help if you could show a few lines of the gff file.

ADD REPLY • link 8.1 years ago by Carlo Yague 8.7k

0

Entering edit mode

Sure. Here's a link to a screen shot of the gff file in galaxy. http://imgur.com/TQu4qZK

ADD REPLY • link 8.1 years ago by cadeans ▴ 10

0

Entering edit mode

I'm not sure I understand you completely. Is this a reference based transcriptome assembly (e.g. cufflinks)?

The GFF shouldn't have sequences in it. Typically they're used to mark features of genes in a genome (UTRs, exons, etc). http://www.sequenceontology.org/gff3.shtml

The GFF will have the locations of features in your genome. You'll have to use the transcript or CDS features extract the transcripts , from there you'll have to predict your protein sequences from the transcripts. There are several scripts posted on here and other places for extracting sequences from a GFF. For the peptide predictions I would use TransDecoder.

ADD REPLY • link 8.1 years ago by pld 5.1k

0

Entering edit mode

I used TopHat with a reference genome. It's just that the genome is very new and the lab that provided me with it hasn't published it yet. I guess my issue is, now that I have done my DE analysis and have lists of DE gene IDs, how do I assemble the sequences that correspond to these IDs into a fasta file to BLAST them (without have to build the fasta files by hand, e.g. cutting/pasting the sequences one by one for each DE gene).

thanks, carrie

ADD REPLY • link 8.1 years ago by cadeans ▴ 10

0

Entering edit mode

There's a tool in the TopHat suite just for this purpose:

http://cole-trapnell-lab.github.io/cufflinks/file_formats/

See the gffread utility. Not sure how it will treat the protein sequences in the GFF file. Where did these protein sequences come from? Did the lab you got the genome from provide you with annotations for the genome?

ADD REPLY • link 8.1 years ago by pld 5.1k

score 1 · Answer 1 · 2016-03-25

Ok so from your question and the picture, I understand that you just want to parse your gff file to retrieve the sequences as fasta for the genes that are differentially expressed.

You can do that using the terminal with grep, sed and awk. For this I imagine that your genes ID you are interested in are saved in a file with one ID per line (myIDs.txt). The grep command will extract the lines of your GFF (myGFF.gff) file that match the IDs, then the result is sent to sed that will change the delimiters to faciliate parsing with awk and save the results in myFasta.fasta

grep -F -f myIDs.txt myGFF.gff | sed "s/\=/\;/g" | sed "s/\[/\;/g" | sed "s/]/\;/g" | awk -F ";" '{print ">"$2"\n"$7}' > myFasta.fasta

At least, it works with my toy example :

echo -e "scaff\tScipio\tgene\t8180\t86694\t.\t-\t.\tID=1710089;Name=Hz5G2000001;protein sequence=[ADADADJIJAODZD]"  | sed "s/\=/\;/g" | sed "s/\[/\;/g" | sed "s/]/\;/g" | awk -F ";" '{print ">"$2"\n"$7}'
>1710089
ADADADJIJAODZD

score 0 · Answer 2 · 2016-03-24

0

Entering edit mode

8.1 years ago

mastal511 ★ 2.1k

Have you looked at the R/Bioconductor packages IRanges and GenomicRanges?

ADD COMMENT • link 8.1 years ago by mastal511 ★ 2.1k