Question: How to get fasta sequences of features from a GFF3 file?
1
gravatar for Lina F
20 months ago by
Lina F150
Boston, MA
Lina F150 wrote:

Hi all,

I am interested in pulling out "CDS" fasta sequences from a GFF3 file. My GFF3 file lists one or more fasta sequences at the end.

I looked into using the gffutils python package but according to this documentation, it does not deal with fasta sequences listed at the end of the GFF3 file.

Am I stuck writing my own parser, or is there a better solution?

Thanks for any suggestions!

EDITED to add a snippet of my GFF file:

##gff-version 3
##sequence-region Consensus_10_consensus_sequence_1 1 645439
Consensus_10_consensus_sequence_1   feature gene    551 2104    .   +   .   name=gene
Consensus_10_consensus_sequence_1   feature CDS 551 2104    .   +   0   name=YALI0C06512p CDS
...
Consensus_9_consensus_sequence_4    feature mRNA    9625    9891    .   +   .   name=mRNA
Consensus_9_consensus_sequence_4    feature CDS 9625    9891    .   +   0   name=YALI0A21351p CDS
##FASTA
>Consensus_10_consensus_sequence_1 <unknown description>
TGCCACCTCCAAATTAACTCTCGCTTATTTCTTGTACCTGTCATATCACGTGATGTAGCT
TCCCAATCAAGAGCGGATCCTGCCTGTTTGGCTGCGTGGGTTTGCGTCTTCTTTCCGTTT
GAAGCAGTGGTATTATTCCCCCATTGTGCCAAAAGTAATGCTGAAAAGATGCCCACGAAT
>Consensus_10_consensus_sequence_2 <unknown description>
CCGAAACCACAGCCATGATCAGAGTCACTCCTATTCCAACCCCGCCAACTGCCGTGGGGT
ACACGTTATATTGGGTAGGTGTGTATCCTTGAGACTTGAGCCAGGAGATCATGGAAGGTT
GGGTGCTGGCAATACAGGTGTTATTGTAGCAAAGAAAGATGAGAGAGAAGAAATAAATGT
GCCAGGTTTTCAAAGAGCGTTTCAAAACTTGCAAGAACGGCTCTTCTTTGTTACCGGTAT
CGCCGATTTGTGCTCTTCTCTCCTTGGCTATAGCAATGTCTTCCTTGGTGAAGTACCAGC
TGGTAGTTGTCTCCGGTGTGTTTGGGTTGACGAACATGGTGTAAAGTGCCACTGGAAATG
>Consensus_10_consensus_sequence_2 <unknown description>
CACTCTCAAGCATTTAGGAACTTGTCAAGAGGTTCAAAGGTTGGAACTTGCAACTGAACT
GATCGCAACATAATCACCGCTATTAAACCCTGATTACAAGTGCTTCTGATTGCATCCACA
GTTCATTTCCATGGGCTAGGCTATACGAAAATACAAGGATTAGAAACTATATACAATTGA
CTCTGCAATCTTTCCCGCTAAACGGTGGTGTGGTTATGACCTGGCTCGTGTTCATGGCCG
...
parser fasta gff3 • 2.0k views
ADD COMMENTlink modified 20 months ago by Matt Shirley8.9k • written 20 months ago by Lina F150

My GFF3 file lists one or more fasta sequences at the end.

What does that mean?

ADD REPLYlink written 20 months ago by genomax64k

Please post a snippet of your GFF3 file.

ADD REPLYlink written 20 months ago by Alex Reynolds27k
3
gravatar for Matt Shirley
20 months ago by
Matt Shirley8.9k
Cambridge, MA
Matt Shirley8.9k wrote:

You should just split your file into two separate files, a GFF and FASTA. Then you can do:

ADD COMMENTlink written 20 months ago by Matt Shirley8.9k

this approach seemed easiest -- thanks!

ADD REPLYlink written 19 months ago by Lina F150
0
gravatar for prasundutta87
20 months ago by
prasundutta87330
prasundutta87330 wrote:

As you have the coordinates of your sequence of interest, you can use samtools faidx.

You first have to make an index of your reference genome using samtools faidx itself after which you can then use the same command to extract your sequence of interest..the usage example available from the command line of samtools faidx is helpful..

ADD COMMENTlink modified 20 months ago • written 20 months ago by prasundutta87330

If the FASTA sequence is already in the attributes field, it may be a simple parsing step to excise that info.

ADD REPLYlink written 20 months ago by Alex Reynolds27k
0
gravatar for bartosovic.marek
20 months ago by
bartosovic.marek30 wrote:

bedtools getfasta can do that

use -s to force strandness

http://bedtools.readthedocs.io/en/latest/content/tools/getfasta.html

bedtools getfasta -fi yourgenome.fa -fo out.fa -bed yourgff.bed -s

ADD COMMENTlink modified 20 months ago • written 20 months ago by bartosovic.marek30
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 940 users visited in the last hour