Question: How to get fasta sequences of features from a GFF3 file?
1
gravatar for Lina F
2.4 years ago by
Lina F160
Boston, MA
Lina F160 wrote:

Hi all,

I am interested in pulling out "CDS" fasta sequences from a GFF3 file. My GFF3 file lists one or more fasta sequences at the end.

I looked into using the gffutils python package but according to this documentation, it does not deal with fasta sequences listed at the end of the GFF3 file.

Am I stuck writing my own parser, or is there a better solution?

Thanks for any suggestions!

EDITED to add a snippet of my GFF file:

##gff-version 3
##sequence-region Consensus_10_consensus_sequence_1 1 645439
Consensus_10_consensus_sequence_1   feature gene    551 2104    .   +   .   name=gene
Consensus_10_consensus_sequence_1   feature CDS 551 2104    .   +   0   name=YALI0C06512p CDS
...
Consensus_9_consensus_sequence_4    feature mRNA    9625    9891    .   +   .   name=mRNA
Consensus_9_consensus_sequence_4    feature CDS 9625    9891    .   +   0   name=YALI0A21351p CDS
##FASTA
>Consensus_10_consensus_sequence_1 <unknown description>
TGCCACCTCCAAATTAACTCTCGCTTATTTCTTGTACCTGTCATATCACGTGATGTAGCT
TCCCAATCAAGAGCGGATCCTGCCTGTTTGGCTGCGTGGGTTTGCGTCTTCTTTCCGTTT
GAAGCAGTGGTATTATTCCCCCATTGTGCCAAAAGTAATGCTGAAAAGATGCCCACGAAT
>Consensus_10_consensus_sequence_2 <unknown description>
CCGAAACCACAGCCATGATCAGAGTCACTCCTATTCCAACCCCGCCAACTGCCGTGGGGT
ACACGTTATATTGGGTAGGTGTGTATCCTTGAGACTTGAGCCAGGAGATCATGGAAGGTT
GGGTGCTGGCAATACAGGTGTTATTGTAGCAAAGAAAGATGAGAGAGAAGAAATAAATGT
GCCAGGTTTTCAAAGAGCGTTTCAAAACTTGCAAGAACGGCTCTTCTTTGTTACCGGTAT
CGCCGATTTGTGCTCTTCTCTCCTTGGCTATAGCAATGTCTTCCTTGGTGAAGTACCAGC
TGGTAGTTGTCTCCGGTGTGTTTGGGTTGACGAACATGGTGTAAAGTGCCACTGGAAATG
>Consensus_10_consensus_sequence_2 <unknown description>
CACTCTCAAGCATTTAGGAACTTGTCAAGAGGTTCAAAGGTTGGAACTTGCAACTGAACT
GATCGCAACATAATCACCGCTATTAAACCCTGATTACAAGTGCTTCTGATTGCATCCACA
GTTCATTTCCATGGGCTAGGCTATACGAAAATACAAGGATTAGAAACTATATACAATTGA
CTCTGCAATCTTTCCCGCTAAACGGTGGTGTGGTTATGACCTGGCTCGTGTTCATGGCCG
...
parser fasta gff3 • 3.1k views
ADD COMMENTlink modified 2.4 years ago by Matt Shirley9.2k • written 2.4 years ago by Lina F160

Please post a snippet of your GFF3 file.

ADD REPLYlink written 2.4 years ago by Alex Reynolds29k
4
gravatar for Matt Shirley
2.4 years ago by
Matt Shirley9.2k
Cambridge, MA
Matt Shirley9.2k wrote:

You should just split your file into two separate files, a GFF and FASTA. Then you can do:

ADD COMMENTlink written 2.4 years ago by Matt Shirley9.2k

this approach seemed easiest -- thanks!

ADD REPLYlink written 2.4 years ago by Lina F160

How would you translate it to the amino acid sequence?

ADD REPLYlink written 3 months ago by Ric280

Biopython? They have plethora of methods mimicking the central dogma essentially

ADD REPLYlink written 5 weeks ago by manaswwm30
0
gravatar for prasundutta87
2.4 years ago by
prasundutta87360
prasundutta87360 wrote:

As you have the coordinates of your sequence of interest, you can use samtools faidx.

You first have to make an index of your reference genome using samtools faidx itself after which you can then use the same command to extract your sequence of interest..the usage example available from the command line of samtools faidx is helpful..

ADD COMMENTlink modified 2.4 years ago • written 2.4 years ago by prasundutta87360

If the FASTA sequence is already in the attributes field, it may be a simple parsing step to excise that info.

ADD REPLYlink written 2.4 years ago by Alex Reynolds29k
0
gravatar for bartosovic.marek
2.4 years ago by
bartosovic.marek30 wrote:

bedtools getfasta can do that

use -s to force strandness

http://bedtools.readthedocs.io/en/latest/content/tools/getfasta.html

bedtools getfasta -fi yourgenome.fa -fo out.fa -bed yourgff.bed -s

ADD COMMENTlink modified 2.4 years ago • written 2.4 years ago by bartosovic.marek30
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1926 users visited in the last hour