How to extract exon sequences from annotated genome
2
0
Entering edit mode
5.9 years ago
mforthman ▴ 40

I have a .gff3 file for an insect genome, which looks like the following:

> Scaffold1 I5K gene    30267   30861   .   +   .   ID=OFAS000001;Name=OfasTmpM000001;
> Scaffold1 I5K mRNA    30267   30861   .   +   .   ID=OFAS000001-RA;Name=OfasTmpM000001-RA;Parent=OFAS000001;
> Scaffold1 I5K exon    30267   30428   .   +   .   ID=OFAS000001-RA-EXON01;Parent=OFAS000001-RA;
> Scaffold1 I5K exon    30763   30861   .   +   .   ID=OFAS000001-RA-EXON02;Parent=OFAS000001-RA;
> Scaffold1 I5K CDS 30267   30428   .   +   0   ID=OFAS000001-RA-CDS01;Parent=OFAS000001-RA;
> Scaffold1 I5K CDS 30763   30861   .   +   0   ID=OFAS000001-RA-CDS02;Parent=OFAS000001-RA;
> Scaffold1 I5K gene    401906  407738  .   +   .   ID=OFAS000002;Name=OfasTmpA000002;
> Scaffold1 I5K mRNA    401906  407738  .   +   .   ID=OFAS000002-RA;Name=OfasTmpA000002-RA;Parent=OFAS000002;Dbxref=Coils:Coil;
> Scaffold1 I5K exon    401906  401926  .   +   .   ID=OFAS000002-RA-EXON01;Parent=OFAS000002-RA;
> Scaffold1 I5K exon    402939  407738  .   +   .   ID=OFAS000002-RA-EXON02;Parent=OFAS000002-RA;


I have the corresponding genome and CDS fasta files. What I need from all of this is the sequence of each exon in one fasta file. I only want individual exons, nothing else. I've seen this: Extract Cds Fastas From A Gff Annotation + Reference Sequence which seeks to extract CDS sequences only, but I have no clue how to adapt this for my purpose (I'm not skilled in scripting). I've also searched for some scripts for my purpose, but everything I've seen deals with obtaining CDS sequences from .gff files. Any help would be most appreciated.

annotation genome exon • 6.4k views
1
Entering edit mode

Here is an idea of how to do this: Extract the lines (which say exon) and then extract the Scaffold name, start, stop positions from those lines in a BED file. Then you can use getfasta from BEDTools to get the sequence of the exons.

If someone does not provide code for you by later today I will take a look.

Edit: @igor and I posted basically the same idea at the same time. You can follow his since he has the code in there.

0
Entering edit mode

Thank y'all!

How can I also get the ID names included in the sequence headings? I stumbled upon Galaxy to get exon sequences after filtering exons using Excel sorting function. However, the Galaxy output headings for the sequences read as "Scaffold[some number]_[start position]_[end position]_[strand direction]", for example:

>Scaffold8389_2908_2948_+
AACTTAGTGCAAATCCAAGTTTCTATAATCGAGTACTCTTT
>Scaffold8389_4377_4550_+
GCTTTCCCTAATGATAAAGTAAAAGAAGTTGAAGAAAAGATGTTAAATTC
ACTTAATCCCAAGTTAATACAGCTCTTCCCGTCATCTGTAATGCATATGT
TTGATCCTTTGGCTTCCAGTTCAGTTCGTGAGTACCCTATTTTCTCTGAT
AAAATCTTAGAAAACCAAGCCTTT


While this is great, having the ID of the corresponding sequence would be more helpful. This is what my .gff3 file now looks like:

Scaffold1   30267   30428   .   +   .   ID=OFAS000001-RA-EXON01;
Scaffold1   30763   30861   .   +   .   ID=OFAS000001-RA-EXON02;
Scaffold1   401906  401926  .   +   .   ID=OFAS000002-RA-EXON01;
Scaffold1   402939  407738  .   +   .   ID=OFAS000002-RA-EXON02;
Scaffold1   674114  674596  .   +   .   ID=OFAS000003-RA-EXON01;
Scaffold1   707416  707522  .   +   .   ID=OFAS000004-RA-EXON01;
Scaffold1   709746  709854  .   +   .   ID=OFAS000004-RA-EXON02;
Scaffold1   710078  712192  .   +   .   ID=OFAS000005-RA-EXON01;
Scaffold1   771803  771959  .   -   .   ID=OFAS000006-RA-EXON01;
Scaffold1   777418  777433  .   -   .   ID=OFAS000006-RA-EXON02;

0
Entering edit mode

bedtools getfasta has a -name flag you can add to "use the “name” column in the BED file for the FASTA headers in the output FASTA file". I am not sure what it will do in your case, but it's worth a try.

0
Entering edit mode

Bedtools worked but took some reformatting of my gff file for it to work, which Excel is very useful to do!

3
Entering edit mode
5.9 years ago
igor 12k

You can use bedtools to convert GFF to FASTA (bedtools is not just for BED files):

bedtools getfasta -fi genome.fa -bed file.gff -fo exons.fa .

You may want to filter it for exons only first (grep -w "exon" file.gff) if you want just exons.

0
Entering edit mode

the exon ID must the exactly the same in both (genome.fa and file.gff) input files? thanks

0
Entering edit mode

The exon info is only in the GTF/GFF file. The FASTA file is just the sequence.

If you are asking about the chromosome (or contig) names, then those should match between both files. That is the only link between the two.

0
Entering edit mode
5.9 years ago
EagleEye 7.3k

This might be helpful, pass the exon coordinates to fastscmd

A: Retrieve a subset of FASTA from large multi-FASTA file